Generating recommendations using adversarial counterfactual learning and evaluation

ABSTRACT

A system including one or more processors and one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, perform certain acts. The acts can include obtaining training data. The acts also can include training candidate recommendation models and an adversarial exposure model using the training data. The acts additionally can include generating recommendations based on a selected recommendation model of the candidate recommendation models. Other embodiments are described.

TECHNICAL FIELD

This disclosure relates generally to generating recommendations using adversarial counterfactual learning and evaluation.

BACKGROUND

Item recommendations can assist a user when selecting items online. For example, when a user views an anchor item, one or more recommended items can be displayed, which can be items that are similar and/or complementary to the anchor item. Many recommendations models are used conventionally. These recommendation models typically do not account for the underlying exposure mechanism, which can result in suboptimal recommendations.

BRIEF DESCRIPTION OF THE DRAWINGS

To facilitate further description of the embodiments, the following drawings are provided in which:

FIG. 1 illustrates a front elevational view of a computer system that is suitable for implementing an embodiment of the system disclosed in FIG. 3;

FIG. 2 illustrates a representative block diagram of an example of the elements included in the circuit boards inside a chassis of the computer system of FIG. 1;

FIG. 3 illustrates a block diagram of a system that can be employed for generating recommendations using adversarial counterfactual learning and evaluation, according to an embodiment;

FIG. 4 illustrates block diagrams for three different settings of causal inference views;

FIG. 5, which illustrates plots that show results from an adversarial training process on the Goodread synthetic data using adversarial counterfactual learning (ACL) (generalized matrix factorization (GMF)/GMF) and ACL (multi-layer perceptron (MLP)/MLP);

FIG. 6 illustrates plots that show results of an adversarial training process on the real Goodread data using ACL (attention-based model (Attn)/Attn) that results in the same pattern for the sequential recommendation setting, and demonstrates the effectiveness of including the outcomes for modeling the exposure mechanism;

FIG. 7 illustrates graphs showing results of sensitivity analysis of hidden factor dimension for the content-based ACL (GMF/GMF) model and the sequential ACL (Attn/Attn) model together with their corresponding baseline models, on the three real-world datasets;

FIG. 8 illustrates graphs showing results of sensitivity analysis on the regularization parameter a for the content-based ACL (GMF/GMF) model and the sequential ACL (Attn/Attn) model for their f_(θ) and g_(ψ) components, on the three real-world datasets;

FIG. 9 includes tables showing unbiased evaluations (using the true exposure) for the baselines and the ACL approach on the semi-synthetic data;

FIG. 10 includes a table showing that the models trained by the ACL approach achieve the best outcome;

FIG. 11 includes a table showing mean-squared error (MSE) to online evaluation results from eight online experiments;

FIG. 12 includes tables showing standard evaluations on the real-world data using the propensity-score models and considering all ACL base model;

FIG. 13 illustrates a block diagram of a system 1300 that can be employed for generating recommendations using adversarial counterfactual learning and evaluation, according to another embodiment; and

FIG. 14 illustrates a flow chart for a method 1400 of generating recommendations using adversarial counterfactual learning and evaluation, according to another embodiment.

For simplicity and clarity of illustration, the drawing figures illustrate the general manner of construction, and descriptions and details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the present disclosure. Additionally, elements in the drawing figures are not necessarily drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve understanding of embodiments of the present disclosure. The same reference numerals in different figures denote the same elements.

The terms “first,” “second,” “third,” “fourth,” and the like in the description and in the claims, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms “include,” and “have,” and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, device, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, system, article, device, or apparatus.

The terms “left,” “right,” “front,” “back,” “top,” “bottom,” “over,” “under,” and the like in the description and in the claims, if any, are used for descriptive purposes and not necessarily for describing permanent relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the apparatus, methods, and/or articles of manufacture described herein are, for example, capable of operation in other orientations than those illustrated or otherwise described herein.

The terms “couple,” “coupled,” “couples,” “coupling,” and the like should be broadly understood and refer to connecting two or more elements mechanically and/or otherwise. Two or more electrical elements may be electrically coupled together, but not be mechanically or otherwise coupled together. Coupling may be for any length of time, e.g., permanent or semi-permanent or only for an instant. “Electrical coupling” and the like should be broadly understood and include electrical coupling of all types. The absence of the word “removably,” “removable,” and the like near the word “coupled,” and the like does not mean that the coupling, etc. in question is or is not removable.

As defined herein, two or more elements are “integral” if they are comprised of the same piece of material. As defined herein, two or more elements are “non-integral” if each is comprised of a different piece of material.

As defined herein, “approximately” can, in some embodiments, mean within plus or minus ten percent of the stated value. In other embodiments, “approximately” can mean within plus or minus five percent of the stated value. In further embodiments, “approximately” can mean within plus or minus three percent of the stated value. In yet other embodiments, “approximately” can mean within plus or minus one percent of the stated value.

As defined herein, “real-time” can, in some embodiments, be defined with respect to operations carried out as soon as practically possible upon occurrence of a triggering event. A triggering event can include receipt of data necessary to execute a task or to otherwise process information. Because of delays inherent in transmission and/or in computing speeds, the term “real-time” encompasses operations that occur in “near” real-time or somewhat delayed from a triggering event. In a number of embodiments, “real-time” can mean real-time less a time delay for processing (e.g., determining) and/or transmitting data. The particular time delay can vary depending on the type and/or amount of the data, the processing speeds of the hardware, the transmission capability of the communication hardware, the transmission distance, etc. However, in many embodiments, the time delay can be less than approximately 0.1 second, 0.5 second, one second, two seconds, five seconds, or ten seconds.

DESCRIPTION OF EXAMPLES OF EMBODIMENTS

Turning to the drawings, FIG. 1 illustrates an exemplary embodiment of a computer system 100, all of which or a portion of which can be suitable for (i) implementing part or all of one or more embodiments of the techniques, methods, and systems and/or (ii) implementing and/or operating part or all of one or more embodiments of the non-transitory computer readable media described herein. As an example, a different or separate one of computer system 100 (and its internal components, or one or more elements of computer system 100) can be suitable for implementing part or all of the techniques described herein. Computer system 100 can comprise chassis 102 containing one or more circuit boards (not shown), a Universal Serial Bus (USB) port 112, a Compact Disc Read-Only Memory (CD-ROM) and/or Digital Video Disc (DVD) drive 116, and a hard drive 114. A representative block diagram of the elements included on the circuit boards inside chassis 102 is shown in FIG. 2. A central processing unit (CPU) 210 in FIG. 2 is coupled to a system bus 214 in FIG. 2. In various embodiments, the architecture of CPU 210 can be compliant with any of a variety of commercially distributed architecture families.

Continuing with FIG. 2, system bus 214 also is coupled to memory storage unit 208 that includes both read only memory (ROM) and random access memory (RAM). Non-volatile portions of memory storage unit 208 or the ROM can be encoded with a boot code sequence suitable for restoring computer system 100 (FIG. 1) to a functional state after a system reset. In addition, memory storage unit 208 can include microcode such as a Basic Input-Output System (BIOS). In some examples, the one or more memory storage units of the various embodiments disclosed herein can include memory storage unit 208, a USB-equipped electronic device (e.g., an external memory storage unit (not shown) coupled to universal serial bus (USB) port 112 (FIGS. 1-2)), hard drive 114 (FIGS. 1-2), and/or CD-ROM, DVD, Blu-Ray, or other suitable media, such as media configured to be used in CD-ROM and/or DVD drive 116 (FIGS. 1-2). Non-volatile or non-transitory memory storage unit(s) refer to the portions of the memory storage units(s) that are non-volatile memory and not a transitory signal. In the same or different examples, the one or more memory storage units of the various embodiments disclosed herein can include an operating system, which can be a software program that manages the hardware and software resources of a computer and/or a computer network. The operating system can perform basic tasks such as, for example, controlling and allocating memory, prioritizing the processing of instructions, controlling input and output devices, facilitating networking, and managing files. Exemplary operating systems can include one or more of the following: (i) Microsoft® Windows® operating system (OS) by Microsoft Corp. of Redmond, Wash., United States of America, (ii) Mac® OS X by Apple Inc. of Cupertino, Calif., United States of America, (iii) UNIX® OS, and (iv) Linux® OS. Further exemplary operating systems can comprise one of the following: (i) the iOS® operating system by Apple Inc. of Cupertino, Calif., United States of America, (ii) the Blackberry® operating system by Research In Motion (RIM) of Waterloo, Ontario, Canada, (iii) the WebOS operating system by LG Electronics of Seoul, South Korea, (iv) the Android™ operating system developed by Google, of Mountain View, Calif., United States of America, or (v) the Windows Mobile™ operating system by Microsoft Corp. of Redmond, Wash., United States of America.

As used herein, “processor” and/or “processing module” means any type of computational circuit, such as but not limited to a microprocessor, a microcontroller, a controller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a graphics processor, a digital signal processor, or any other type of processor or processing circuit capable of performing the desired functions. In some examples, the one or more processors of the various embodiments disclosed herein can comprise CPU 210.

In the depicted embodiment of FIG. 2, various I/O devices such as a disk controller 204, a graphics adapter 224, a video controller 202, a keyboard adapter 226, a mouse adapter 206, a network adapter 220, and other I/O devices 222 can be coupled to system bus 214. Keyboard adapter 226 and mouse adapter 206 are coupled to a keyboard 104 (FIGS. 1-2) and a mouse 110 (FIGS. 1-2), respectively, of computer system 100 (FIG. 1). While graphics adapter 224 and video controller 202 are indicated as distinct units in FIG. 2, video controller 202 can be integrated into graphics adapter 224, or vice versa in other embodiments. Video controller 202 is suitable for refreshing a monitor 106 (FIGS. 1-2) to display images on a screen 108 (FIG. 1) of computer system 100 (FIG. 1). Disk controller 204 can control hard drive 114 (FIGS. 1-2), USB port 112 (FIGS. 1-2), and CD-ROM and/or DVD drive 116 (FIGS. 1-2). In other embodiments, distinct units can be used to control each of these devices separately.

In some embodiments, network adapter 220 can comprise and/or be implemented as a WNIC (wireless network interface controller) card (not shown) plugged or coupled to an expansion port (not shown) in computer system 100 (FIG. 1). In other embodiments, the WNIC card can be a wireless network card built into computer system 100 (FIG. 1). A wireless network adapter can be built into computer system 100 (FIG. 1) by having wireless communication capabilities integrated into the motherboard chipset (not shown), or implemented via one or more dedicated wireless communication chips (not shown), connected through a PCI (peripheral component interconnector) or a PCI express bus of computer system 100 (FIG. 1) or USB port 112 (FIG. 1). In other embodiments, network adapter 220 can comprise and/or be implemented as a wired network interface controller card (not shown).

Although many other components of computer system 100 (FIG. 1) are not shown, such components and their interconnection are well known to those of ordinary skill in the art. Accordingly, further details concerning the construction and composition of computer system 100 (FIG. 1) and the circuit boards inside chassis 102 (FIG. 1) are not discussed herein.

When computer system 100 in FIG. 1 is running, program instructions stored on a USB drive in USB port 112, on a CD-ROM or DVD in CD-ROM and/or DVD drive 116, on hard drive 114, or in memory storage unit 208 (FIG. 2) are executed by CPU 210 (FIG. 2). A portion of the program instructions, stored on these devices, can be suitable for carrying out all or at least part of the techniques described herein. In various embodiments, computer system 100 can be reprogrammed with one or more modules, system, applications, and/or databases, such as those described herein, to convert a general purpose computer to a special purpose computer. For purposes of illustration, programs and other executable program components are shown herein as discrete systems, although it is understood that such programs and components may reside at various times in different storage components of computing device 100, and can be executed by CPU 210. Alternatively, or in addition to, the systems and procedures described herein can be implemented in hardware, or a combination of hardware, software, and/or firmware. For example, one or more application specific integrated circuits (ASICs) can be programmed to carry out one or more of the systems and procedures described herein. For example, one or more of the programs and/or executable program components described herein can be implemented in one or more ASICs.

Although computer system 100 is illustrated as a desktop computer in FIG. 1, there can be examples where computer system 100 may take a different form factor while still having functional elements similar to those described for computer system 100. In some embodiments, computer system 100 may comprise a single computer, a single server, or a cluster or collection of computers or servers, or a cloud of computers or servers. Typically, a cluster or collection of servers can be used when the demand on computer system 100 exceeds the reasonable capability of a single server or computer. In certain embodiments, computer system 100 may comprise a portable computer, such as a laptop computer. In certain other embodiments, computer system 100 may comprise a mobile device, such as a smartphone. In certain additional embodiments, computer system 100 may comprise an embedded system.

Turning ahead in the drawings, FIG. 3 illustrates a block diagram of a system 300 that can be employed for generating recommendations using adversarial counterfactual learning and evaluation, according to an embodiment. System 300 is merely exemplary and embodiments of the system are not limited to the embodiments presented herein. The system can be employed in many different embodiments or examples not specifically depicted or described herein. In some embodiments, certain elements, modules, or systems of system 300 can perform various procedures, processes, and/or activities. In other embodiments, the procedures, processes, and/or activities can be performed by other suitable elements, modules, or systems of system 300. In some embodiments, system 300 can include a recommendation system 310 and/or web server 320.

Generally, therefore, system 300 can be implemented with hardware and/or software, as described herein. In some embodiments, part or all of the hardware and/or software can be conventional, while in these or other embodiments, part or all of the hardware and/or software can be customized (e.g., optimized) for implementing part or all of the functionality of system 300 described herein.

Recommendation system 310 and/or web server 320 can each be a computer system, such as computer system 100 (FIG. 1), as described above, and can each be a single computer, a single server, or a cluster or collection of computers or servers, or a cloud of computers or servers. In another embodiment, a single computer system can host recommendation system 310 and/or web server 320. Additional details regarding recommendation system 310 and/or web server 320 are described herein.

In some embodiments, web server 320 can be in data communication through a network 330 with one or more user devices, such as a user device 340. User device 340 can be part of system 300 or external to system 300. Network 330 can be the Internet or another suitable network. In some embodiments, user device 340 can be used by users, such as a user 350. In many embodiments, web server 320 can host one or more websites and/or mobile application servers. For example, web server 320 can host a website, or provide a server that interfaces with an application (e.g., a mobile application), on user device 340, which can allow users (e.g., 350) to browse and/or search for items (e.g., products, grocery items), to add items to an electronic cart, and/or to purchase items, in addition to other suitable activities. In a number of embodiments, web server 320 can interface with recommendation system 310 when a user (e.g., 350) is viewing items in order to recommend items to the user.

In some embodiments, an internal network that is not open to the public can be used for communications between recommendation system 310 and web server 320 within system 300. Accordingly, in some embodiments, recommendation system 310 (and/or the software used by such systems) can refer to a back end of system 300 operated by an operator and/or administrator of system 300, and web server 320 (and/or the software used by such systems) can refer to a front end of system 300, as is can be accessed and/or used by one or more users, such as user 350, using user device 340. In these or other embodiments, the operator and/or administrator of system 300 can manage system 300, the processor(s) of system 300, and/or the memory storage unit(s) of system 300 using the input device(s) and/or display device(s) of system 300.

In certain embodiments, the user devices (e.g., user device 340) can be desktop computers, laptop computers, mobile devices, and/or other endpoint devices used by one or more users (e.g., user 350). A mobile device can refer to a portable electronic device (e.g., an electronic device easily conveyable by hand by a person of average size) with the capability to present audio and/or visual data (e.g., text, images, videos, music, etc.). For example, a mobile device can include at least one of a digital media player, a cellular telephone (e.g., a smartphone), a personal digital assistant, a handheld digital computer device (e.g., a tablet personal computer device), a laptop computer device (e.g., a notebook computer device, a netbook computer device), a wearable user computer device, or another portable computer device with the capability to present audio and/or visual data (e.g., images, videos, music, etc.). Thus, in many examples, a mobile device can include a volume and/or weight sufficiently small as to permit the mobile device to be easily conveyable by hand For examples, in some embodiments, a mobile device can occupy a volume of less than or equal to approximately 1790 cubic centimeters, 2434 cubic centimeters, 2876 cubic centimeters, 4056 cubic centimeters, and/or 5752 cubic centimeters. Further, in these embodiments, a mobile device can weigh less than or equal to 15.6 Newtons, 17.8 Newtons, 22.3 Newtons, 31.2 Newtons, and/or 44.5 Newtons.

Exemplary mobile devices can include (i) an iPod®, iPhone®, iTouch®, iPad®, MacBook® or similar product by Apple Inc. of Cupertino, Calif., United States of America, (ii) a Blackberry® or similar product by Research in Motion (RIM) of Waterloo, Ontario, Canada, (iii) a Lumia® or similar product by the Nokia Corporation of Keilaniemi, Espoo, Finland, and/or (iv) a Galaxy™ or similar product by the Samsung Group of Samsung Town, Seoul, South Korea. Further, in the same or different embodiments, a mobile device can include an electronic device configured to implement one or more of (i) the iPhone® operating system by Apple Inc. of Cupertino, Calif., United States of America, (ii) the Blackberry® operating system by Research In Motion (RIM) of Waterloo, Ontario, Canada, (iii) the AndroidTM operating system developed by the Open Handset Alliance, or (iv) the Windows Mobile™ operating system by Microsoft Corp. of Redmond, Wash., United States of America.

In many embodiments, recommendation system 310 and/or web server 320 can each include one or more input devices (e.g., one or more keyboards, one or more keypads, one or more pointing devices such as a computer mouse or computer mice, one or more touchscreen displays, a microphone, etc.), and/or can each comprise one or more display devices (e.g., one or more monitors, one or more touch screen displays, projectors, etc.). In these or other embodiments, one or more of the input device(s) can be similar or identical to keyboard 104 (FIG. 1) and/or a mouse 110 (FIG. 1). Further, one or more of the display device(s) can be similar or identical to monitor 106 (FIG. 1) and/or screen 108 (FIG. 1). The input device(s) and the display device(s) can be coupled to recommendation system 310 and/or web server 320 in a wired manner and/or a wireless manner, and the coupling can be direct and/or indirect, as well as locally and/or remotely. As an example of an indirect manner (which may or may not also be a remote manner), a keyboard-video-mouse (KVM) switch can be used to couple the input device(s) and the display device(s) to the processor(s) and/or the memory storage unit(s). In some embodiments, the KVM switch also can be part of recommendation system 310 and/or web server 320. In a similar manner, the processors and/or the non-transitory computer-readable media can be local and/or remote to each other.

Meanwhile, in many embodiments, recommendation system 310 and/or web server 320 also can be configured to communicate with one or more databases, such as a database system 315. The one or more databases can include a product database that contains information about products, items, or SKUs (stock keeping units), for example, among other information, as described below in further detail. The one or more databases can be stored on one or more memory storage units (e.g., non-transitory computer readable media), which can be similar or identical to the one or more memory storage units (e.g., non-transitory computer readable media) described above with respect to computer system 100 (FIG. 1). Also, in some embodiments, for any particular database of the one or more databases, that particular database can be stored on a single memory storage unit or the contents of that particular database can be spread across multiple ones of the memory storage units storing the one or more databases, depending on the size of the particular database and/or the storage capacity of the memory storage units.

The one or more databases can each include a structured (e.g., indexed) collection of data and can be managed by any suitable database management systems configured to define, create, query, organize, update, and manage database(s). Exemplary database management systems can include MySQL (Structured Query Language) Database, PostgreSQL Database, Microsoft SQL Server Database, Oracle Database, SAP (Systems, Applications, & Products) Database, and IBM DB2 Database.

Meanwhile, recommendation system 310, web server 320, and/or the one or more databases can be implemented using any suitable manner of wired and/or wireless communication. Accordingly, system 300 can include any software and/or hardware components configured to implement the wired and/or wireless communication. Further, the wired and/or wireless communication can be implemented using any one or any combination of wired and/or wireless communication network topologies (e.g., ring, line, tree, bus, mesh, star, daisy chain, hybrid, etc.) and/or protocols (e.g., personal area network (PAN) protocol(s), local area network (LAN) protocol(s), wide area network (WAN) protocol(s), cellular network protocol(s), powerline network protocol(s), etc.). Exemplary PAN protocol(s) can include Bluetooth, Zigbee, Wireless Universal Serial Bus (USB), Z-Wave, etc.; exemplary LAN and/or WAN protocol(s) can include Institute of Electrical and Electronic Engineers (IEEE) 802.3 (also known as Ethernet), IEEE 802.11 (also known as WiFi), etc.; and exemplary wireless cellular network protocol(s) can include Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Evolution-Data Optimized (EV-DO), Enhanced Data Rates for GSM Evolution (EDGE), Universal Mobile Telecommunications System (UMTS), Digital Enhanced Cordless Telecommunications (DECT), Digital AMPS (IS-136/Time Division Multiple Access (TDMA)), Integrated Digital Enhanced Network (iDEN), Evolved High-Speed Packet Access (HSPA+), Long-Term Evolution (LTE), WiMAX, etc. The specific communication software and/or hardware implemented can depend on the network topologies and/or protocols implemented, and vice versa. In many embodiments, exemplary communication hardware can include wired communication hardware including, for example, one or more data buses, such as, for example, universal serial bus(es), one or more networking cables, such as, for example, coaxial cable(s), optical fiber cable(s), and/or twisted pair cable(s), any other suitable data cable, etc. Further exemplary communication hardware can include wireless communication hardware including, for example, one or more radio transceivers, one or more infrared transceivers, etc. Additional exemplary communication hardware can include one or more networking components (e.g., modulator-demodulator components, gateway components, etc.).

In many embodiments, recommendation system 310 can include a communication system 311, a training system 312, an evaluation system 313, a rea-time serving system 314, and/or database system 315. In many embodiments, the systems of recommendation system 310 can be modules of computing instructions (e.g., software modules) stored at non-transitory computer readable media that operate on one or more processors. In other embodiments, the systems of recommendation system 310 can be implemented in hardware. Recommendation system 310 and/or web server 320 each can be a computer system, such as computer system 100 (FIG. 1), as described above, and can be a single computer, a single server, or a cluster or collection of computers or servers, or a cloud of computers or servers. In another embodiment, a single computer system can host recommendation system 310 and/or web server 320. Additional details regarding recommendation system 310 the components thereof are described herein.

In many embodiments, system 300 can provide item recommendations to a user (e.g., as customer) based on an anchor item that the user has selected to view, is about to view, and/or is viewing. For example, when a user selects an item (e.g., a product) to view from a list of products, an item page for the item can display information about the item. This item can be considered the anchor item. The item page also can display information about other items related to the anchor item. These other items can be the item recommendations, and they can be items that are similar and/or complementary to the anchor item. If a user is interested in an anchor item, but it is not exactly what the user wants, a similar item might be what the user wants (e.g., butter and margarine, or butter and shortening). By contrast, complementary items are items that are different (often in different categories), but are often purchased together (e.g., hot dogs and hot dog buns, or hot dogs and ketchup).

Conventionally, item recommendations are shown to users, but a user is exposed to a very small subset of the total number of items available. Moreover, a user's interest can be affected by the items that the user has been shown. For example, if a user has been shown three items and is interested in a first item of the three items, the user's interest in the first item can be based on having seen those three items. For a fourth item that has not been shown to the user, it can be difficult to know if the user will be interest in the fourth item. Moreover, the extent of a user's exposure to items is often unknown, as a user can be exposed to items outside of web server 320, such as through television advertisements, seeing items at a physical store, seeing items on different websites, talking to other people about items. As such, it can be difficult to know if a user is interest in an item or if the user was exposed to the item elsewhere and is looking for more information about the item. A click model is sometimes often used to address this issue, but recommendations from such models often assume the user's interest, which can be an incorrect assumption.

The feedback data of recommender systems are often subject to what was exposed to the users; however, most learning and evaluation methods do not account for the underlying exposure mechanism. Applying supervised learning to detect user preferences can end up with inconsistent results in the absence of exposure information. The counterfactual propensity-weighting approach from causal inference can account for the exposure mechanism; nevertheless, the partial-observation nature of the feedback data can cause identifiability issues. In a number of embodiments, system 300 can use a minimax empirical risk formulation. The relaxation of the dual problem can be converted to an adversarial game between two recommendation models, in which the opponent of the candidate model characterizes the underlying exposure mechanism. Learning bounds can be provided, and simulation studies illustrate and justify the techniques described herein over a broad range of recommendation settings, which can shed insights on the various benefits of the techniques described herein.

In the offline learning and evaluation of recommender systems, the dependency of feedback data on the underlying exposure mechanism is often overlooked. When the users express their preferences on the products explicitly (such as providing ratings) or implicitly (such as clicking), the feedback are conditioned on the products to which they are exposed. In most cases, the previous exposures are decided by some underlying mechanism such as the history recommender system. The dependency causes two dilemmas for machine learning in recommender systems, and solutions have yet to be found satisfactorily. Firstly, the majority of supervised learning models handle merely the dependency between label (user feedback) and features, yet in the actual feedback data, the exposure mechanism can alter the dependency pathways, as shown in FIG. 4 and described below.

From a theoretical perspective, directly applying supervised learning on feedback data can result in inconsistent detection of the user preferences. Secondly, an unbiased model evaluation can have the product exposure determined by the candidate recommendation model, which is almost never satisfied when merely using the feedback data. The second dilemma also reveals a gap between evaluating models by online experiments and using history data, because the offline evaluations can be more likely to bias toward the history exposure mechanism as it decided to what products the users might express their preferences. The disagreement between the online and offline evaluations may partly explain the controversial observations made in several recent papers, in which deep recommendation models are overwhelmed by classical collaborative filtering approaches in offline evaluations, despite their many successful deployments in the real-world applications.

In a number of embodiments, to address the above dilemmas for recommender systems, the idea of counterfactual modeling can be used to redesign the learning and evaluation methods. Counterfactual modeling can answer questions related to “what if”, e.g., what is the feedback data if the candidate model were deployed. The counterfactual methods can take account of the dependency between the feedback data and exposure. Conventional attempts have relied on excessive data or model assumptions, such as the missing-data model described below, which may not be satisfied in practice. Many of the assumptions can be essentially unavoidable due to a fundamental discrepancy between the recommender system and observational studies. In observational studies, the exposure (treatment) status can be fully observed, and the exposure mechanism can be completely decided by the covariates (features). For recommender systems, the exposure can be partially captured by the feedback data. The complete exposure status can be retrieved from the system's backend log, to which access can be highly restricted, and such access rarely exists for the public datasets. Also, the exposure mechanism can depend on intractable randomness, e.g., burst events, special offers, interference with other modules such as the advertisement, as well as the relevant features that are not attainable from feedback data.

Turning ahead in the drawings, FIG. 4 illustrates block diagrams for three different settings of causal inference views. For example, in a setting 410, feedback 403 can be based on exposure 402, user features 401, item feature 404, and, in some cases, use preferences 405. In a setting 420, exposure 402 can be based on user features 401 and item features 404, and feedback can be based on exposure, and in some cases, on user preferences 405, user features 401, and/or item features 404. In a setting 430, model 406 can be based on user features 401 and item features 404, and exposure 402 can be based on model 406, and in some cases, on other factors 407. Feedback 403 can be based on exposure 402, and in some cases, on user preferences 405, user features 401, and/or item features 404. A direct consequence of the above differences between settings 410, 420, and 430 can be that the exposure mechanism is not identifiable from feedback data, i.e., the conditional distribution characterized by the exposure mechanism can be modified without disturbing the observation distribution. Therefore, the conventional methods can make problem-specific and/or unjustifiable assumptions in order to bypass or simply ignore the identifiability issue.

In a number of embodiments, the techniques described herein can acknowledge the uncertainty brought by the identifiability issue and treat it as an adversarial component. A minimax setting can be used in which the candidate model can be optimized over the worst-case exposure mechanism. By applying duality arguments and relaxations, the minimax problem can be converted to an adversarial game between two recommendation models. This approach is novel and principled, which can advantageously provide a theoretical analysis to show an inconsistent issue of supervised learning on recommender systems, which is caused by the unknown exposure mechanism. A minimax setting for counterfactual recommendation can beneficially be used and converted to a tractable two-model adversarial game. The generalization bounds for the adversarial learning described herein are shown, with analysis for the minimax optimization. Simulation and real data experiments demonstrate performance benefits of the techniques described herein.

Bold-faced letters are used to denote vectors and matrices, upper-case letters to denote random variables and the corresponding lower-case letters to denote observations. Distributions are denoted by P and Q. Let x_(u) be the user feature vector for user u ∈ {1, . . . , n}, z_(i) be the item feature vector for item i ∈ {1, . . . , m}, O_(u,i)∈ {0,1} be the exposure status, Y_(u,i) be the feedback, and D be the collected user-item pairs where non-positive interactions may come from negative sampling. The feature vectors can be one-hot encoding or embedding, so this approach can be fully compatible with deep learning models that leverage representation learning and are trained under negative sampling. Recommendation models are denoted by such as f_(θ) and g_(ψ). They take x_(u), z_(i) (and the exposure O_(u,i) if available) as input. The shorthand f_(θ) (u, i) is used to denote the output score, and the loss with respect to the

is given by δ(

_(u,i), f₇₄ (u, i)). These notations also apply to the sequential recommendation by encoding the previously-interacted items to the user feature vector x.

Pg (O_(u,i)|x_(u), z_(i)) is used to denote the exposure mechanism that depends on the underlying model g. Also, p(Y_(u,i)|O_(u,i), x_(u), z_(i)) gives the user response, which is independent from the exposure mechanism whenever O_(u,i) is observed. The stochasticity in the exposure can also be induced by the exogenous factors (unobserved confounders) who bring extra random perturbations. Explicit and implicit feedback settings are not explicitly differentiated unless specified.

Supervised Learning for Feedback Data

Let Y_(u,i) ∈ {−1, 1} be the implicit feedback. Set aside the exposure for a moment, the goal of supervised learning is to determine the optimal recommendation function that minimizes the surrogate loss:

${{{\ell\phi}\left( f_{\theta} \right)} = {\frac{1}{D}{\sum_{{({u,i})} \in D}\left\lbrack {\varnothing\left( {Y_{u,i} \cdot {f_{\theta}\left( {u,i} \right)}} \right)} \right\rbrack}}},$

where ∅ induces the widely-adopted margin-based loss. Now take account of the (unobserved) exposure status by first letting:

p ⁽¹⁾(o)=p(Y _(u,i)=1, O _(u,i)=o, x _(u) , z ^(i)),

p ⁽⁻¹⁾(o)=p(Y _(u,i)=−1, O _(u,i) =o, x _(u) , z _(i)), o ∈{0,1},

to denote the joint distribution for positive and negative feedback under either exposure status. The surrogate loss, which now depends on p⁽¹⁾ and p⁽⁻¹⁾ due to including the exposure, is denoted by L_(∅) (f_(θ), {p⁽¹⁾, p⁽⁻¹⁾}). In assertion 1, it is shown that if the exposure mechanism is fixed and f_(θ) is optimized, the optimal loss and the corresponding f*_(θ) depend merely on p⁽¹⁾ and p⁽⁻¹⁾.

Assertion 1. When the exposure mechanism p(O_(u,i)|X_(u), Z_(i)) is given and fixed, the optimal loss is:

$\begin{matrix} {{\inf\limits_{f_{\theta}}{L_{\varnothing}\left( {f_{\theta},\left\{ {p^{(1)},p^{({- 1})}} \right\}} \right)}} = {- {D_{c}\left( {\left. P^{(1)}||\left( P^{({- 1})} \right) \right.,} \right.}}} & (1) \end{matrix}$

where P⁽¹⁾ and P⁽⁻¹⁾ are the corresponding distributions for p⁽¹⁾ and p⁽⁻¹⁾, and

$D_{c}\left( {\left. P^{(1)}||\left( P^{({- 1})} \right) \right. = {\int{{c\left( \frac{p^{(1)}}{p^{({- 1})}} \right)}{dP}^{({- 1})}}}} \right.$

is the f-divergence induced by the convex, lower-semicontinuous function c. Also, the optimal f*_(θ) that achieves the infimum is given by

$\alpha_{\theta}^{*}\left( \frac{p^{(1)}}{p^{({- 1})}} \right)$

for some function α*_(θ) that depends on ∅.

The proof of Assertion 1 is provided as follows:

Proof. When taking the exposure mechanism into account, minimizing f_(θ) over the loss is implicitly doing inf_(f) _(θ) L_(∅)(f_(θ), {p⁽¹⁾, p⁽⁻¹⁾}), where

${L_{\varnothing}\left( {f_{\theta},\left\{ {p^{(1)},p^{({- 1})}} \right\}} \right)} = {{{\mathbb{E}}\left\lbrack {\varnothing\left( {Y \cdot {f_{\theta}\left( {x,{z;O}} \right)}} \right)} \right\rbrack} = {{\sum\limits_{o \in {\{{0,1}\}}}{{\varnothing\left( {f_{\theta}\left( {x,{z;{O = o}}} \right)} \right)}{p^{(1)}(o)}}} + {{\varnothing\left( {- {f_{\theta}\left( {x,{z;{O = o}}} \right)}} \right)}{{p^{({- 1})}(o)}.}}}}$

For any fixed exposure mechanism p(O|x, z), there is

$\begin{matrix} {{\begin{matrix} \inf \\ f_{\theta} \end{matrix}{L_{\varnothing}\left( {f_{\theta},\left\{ {p^{(1)},p^{({- 1})}} \right\}} \right)}} = {{\begin{matrix} \sum \\ {o \in \left\{ {0,1} \right\}} \end{matrix}\begin{matrix} \inf \\ \alpha \end{matrix}\left\{ {{{\varnothing(\alpha)}{p^{(1)}(o)}} + {{\varnothing\left( {- \alpha} \right)}{p^{({- 1})}(o)}}} \right\}} = {\sum\limits_{o \in {\{{0,1}\}}}{{p^{(1)}(o)}\inf\limits_{\alpha}{\left\{ {{\varnothing(\alpha)} + {{\varnothing\left( {- \alpha} \right)}\frac{p^{({- 1})}(o)}{p^{(1)}(o)}}} \right\}.}}}}} & \left( {A{.1}} \right) \end{matrix}$

For each o ∈{0, 1}, let μ(o)=p⁽⁻¹⁾(o)/p⁽¹⁾(o) and Δ(μ)=−inf_(α))∅(α)+∅(−αμ)).

Notice that Δ(μ) is a convex function of μ since the supremum (negative of the infimum) over a set of affine functions is convex. Since Δ is convex and continuous:

${{\inf\limits_{f_{\theta}}{L_{\varnothing}\left( {f_{\theta},\left\{ {p^{(1)},p^{({- 1})}} \right\}} \right)}} = {{- \begin{matrix} \sum \\ {o \in \left\{ {0,1} \right\}} \end{matrix}}{p^{(1)}(o)}{\Delta\left( \frac{p^{({- 1})}(o)}{p^{(1)}(o)} \right)}}},$

which is exactly the f-divergence D_(Δ)(P⁽¹⁾∥P⁽⁻¹⁾ and induced by Δ.

Also, up on achieving the infimum in (A.1), the optimal f_(θ) is given by solving α_(∅)*(μ)=arg min_(α)(∅(α)+∅(−α)μ).

Notice that the joint distribution can be factorized into: ρ(Y_(u,i)|o_(u,i), x_(u), x_(i)) ∝(Y_(u,i)|o_(u,i), x_(u), z_(i))·Pg(o_(u,i)|x_(u), z_(i)), so Assertion 1 implies that:

f _(θ)*(x _(u) , z _(i) ; o _(u,i))=α_(∅)*(p(Y _(u,i)=1|o _(u,i) , x _(u) , z _(i))/p(Y _(u,i)=−1|o _(u,i) , x _(u) , z _(i))).

In conclusion: (1) when the exposure mechanism is given, the optimal loss −D_(c)(P⁽¹⁾∥P⁽⁻¹⁾) is a function of both the user preference and the exposure mechanism; (2) the optimal model f*_(θ)depends merely on the user preference, because f*_(θ) is a function of p(Y|o,x,z) which does not depend on the exposure mechanism (mentioned at the beginning of this section). Both conclusions are practically reasonable, as the optimal recommendation model should detect user preference regardless of the exposure mechanisms. The optimal loss, on the other hand, depends on the joint distribution in which the underlying exposure mechanism plays a part.

However, when p(O_(u,i)|X_(u), Z_(i)) is unknown, the conclusions from Assertion 1 no longer hold, and the optimal f*_(θ) will depend on the exposure mechanism. As a consequence, if the same feedback data were collected under different exposure mechanisms, the recommendation model may find the user preference differently. The inconsistency is caused by not accounting for the unknown exposure mechanism from the supervised learning.

The Propensity-Weighting Approach

In causal inference, the probability of exposure given the observed features (covariates) is referred to as the propensity score. The propensity-weighting approach uses weights based on the propensity score to create a synthetic sample in which the distribution of observed features is independent of exposure. This approach can be beneficial to make the feedback data independent of the exposure mechanism. The propensity-weighted loss is constructed via:

$\frac{1}{D}\Sigma_{{({u,i})} \in D}\varnothing\frac{\left( {y_{u,i} \cdot {f_{\theta}\left( {x_{u},z_{i}} \right)}} \right)}{p\left( {\left. O_{u,i} \middle| X_{u} \right.,Z_{i}} \right)}$

and by taking the expectation with respect to exposure (whose distribution is denoted by Q), the ordinary loss recovered is:

$\begin{matrix} {{{{\mathbb{E}}_{Q}\left\lbrack {\frac{1}{D}{\sum_{{({u,i})} \in D}\frac{\varnothing\left( {y_{u,i} \cdot {f_{\theta}\left( {x_{u},z_{i}} \right)}} \right)}{p\left( {{O_{u,i} = \left. 1 \middle| x_{u} \right.},z_{i}} \right)}}} \right\rbrack} = {{{\mathbb{E}}\;{p_{n}\left\lbrack {\frac{\varnothing\left( {Y \cdot {f_{\theta}\left( {X,Z} \right)}} \right)}{p\left( {{O = \left. 1 \middle| X \right.},Z} \right)}{p\left( {{O = \left. 1 \middle| X \right.},Z} \right)}} \right\rbrack}} = {\ell_{\varnothing}\left( f_{\theta} \right)}}},} & (2) \end{matrix}$

where the second expectation is taken with respect to the empirical distribution P_(n). Let Q_(o) be the distribution for the underlying exposure mechanism. The propensity-weighted empirical distribution is then given by P_(n)/Q_(o) (after scaling), which can be thought of as the synthetic sample distribution after eliminating the influence from the underlying exposure mechanism. It is straightforward to verify that after scaling, the expected propensity-weighted loss is exactly given by:

p_(n/Q) _(o) [Ø(Y·f_(θ)(X, Z))].

The Hidden Assumption of the Missing-Data (Click) Model

Conventional approaches known as the “click model” deal with the unidentifiable exposure mechanism by assuming a missing-data model:

p(click=1|x)=p(expose=1|x)·p(relevance=1|x).  (3)

While the click model greatly simplifies the problem because the exposure mechanism can now be characterized explicitly, it relies on a hidden assumption that is rarely satisfied in practice. Use R to denote the relevance and Y to denote the click. The fact that Y=1 ⇔O=1 and R=1 implies:

${{\rho\left( {Y = \left. 1 \middle| x \right.} \right)} = {{\rho\left( {{O = 1},{R = \left. 1 \middle| x \right.}} \right)} = {{{{\rho\left( {O = \left. 1 \middle| x \right.} \right)} \cdot {\rho\left( {{R = {\left. 1 \middle| O \right. = 1}},x} \right)}}\overset{(3)}{\Rightarrow}{\rho\left( {{R = {\left. 1 \middle| O \right. = 1}},x} \right)}} = {\rho\left( {R = {1❘x}} \right)}}}},$

which suggests that being relevant is independent of getting exposed given the features. This is rarely true (or at least cannot be examined) in many real-world problems, unless x contains every single factor that may affect the exposure and user preference. By contrast, in many embodiments, the techniques described herein can provide a robust solution when the hidden assumption of the missing-data (click) model is dubious or violated.

Method

Let P* be the ideal exposure-eliminated sample distribution corresponding to P/Q_(o), according to the underlying exposure mechanism Q_(o) and data distribution P. For notation simplicity, without overloading the original meaning by too much, from this point P, P_(n), Q_(o) and P* can be treated as distributions on the sample space X which includes all the observed data (x_(u), z_(i),

_(u,i)) with (u, i) ∈D. Since there are not data or model assumptions made to allow for accurately recovering P*, a minimax formulation is introduced to characterize the uncertainty and optimize f_(θ) against the worst possible choice of (a hypothetical) {circumflex over (P)}, whose discrepancy with the ideal P* can be determined by the data to a neighborhood: Dist(P*, {circumflex over (P)})<ρ. Among the divergence and distribution distance measures, the Wasserstein distance can be chosen for this problem, which is defined as:

$\begin{matrix} {{{W_{c}\left( {\hat{P},P^{*}} \right)} = {\begin{matrix} \inf \\ {\gamma \in {\pi\left( {\hat{P},P^{*}} \right)}} \end{matrix}{\left. {\mathbb{E}}_{({{({x,z,y})},{({x^{\prime},z^{\prime},y^{\prime}})}})} \right.\sim{\gamma\left\lbrack {c\left( {\left( {x,z,y} \right),\left( {x^{\prime},z^{\prime},y^{\prime}} \right)} \right)} \right\rbrack}}}},} & (4) \end{matrix}$

where c: X×X→[0, +∞) is the convex, lower semicontinuous transportation cost function with C (t, t)=0, and II({circumflex over (P)}, P*) is the set of all distributions whose marginals are given by {circumflex over (P)} and P*. Intuitively, the Wasserstein distance can be interpreted as the minimum cost associated with transporting mass between probability measures. The Wasserstein distance can be chosen instead of others in order to understand how to transport from the empirical data distribution to an ideal synthetic data distribution where the observations were independent of the exposure mechanism. Hence, the local minimax empirical risk minimization (ERM) problem can be considered:

$\begin{matrix} {{\underset{f_{\theta} \in F}{minimize}\begin{matrix} \sup \\ {{W_{c}\left( {P^{*},\hat{P}} \right)} <_{\rho}} \end{matrix}{{\mathbb{E}}_{\hat{P}}\left\lbrack {\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)} \right\rbrack}},} & (5) \end{matrix}$

which can directly account for the uncertainty induced by the lack of identifiability in the exposure mechanism, and can optimize f_(θ) under the worst possible setting. However, the formulation in (5) is first of all a constraint optimization problem. Secondly, the constraint is expressed in terms of the hypothetical P*. After applying a duality argument, the dual problem can be expressed via the exposure mechanism in the following Assertion 2. {circumflex over (Q)} is used to denote some estimation of Q_(o).

Assertion 2. Suppose that the transportation cost c is continuous and the propensity score are all bounded away from zero, i.e., ρ(O_(i,u)=1|x_(u), z_(i))≥μ. Let P={P:W_(c)(P*, P)<ρ}, then

${{\begin{matrix} \sup \\ {\hat{P} \in P} \end{matrix}{{\mathbb{E}}_{\hat{P}}\left\lbrack {\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)} \right\rbrack}} = {\begin{matrix} \inf \\ {\alpha \geq 0} \end{matrix}\left\{ {{\alpha\rho} + {\begin{matrix} \sup \\ \hat{Q} \end{matrix}\left\{ {{{\mathbb{E}}_{p}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{\hat{q}\left( {{O = \left. 1 \middle| X \right.},Z} \right)} \right\rbrack} - {c_{0}\alpha\;{W_{c}\left( {{\hat{Q}}^{- 1},Q_{0}^{- 1}} \right)}}} \right\}}} \right\}}},$

where c_(o) is a positive constant and {circumflex over (q)} is the density function associated with {circumflex over (Q)}.

The proof of Assertion 2 is provided as follows, which first proves the dual formulation for the minimax ERM stated in Assertion 2, and then discusses the relaxation for the dual problem:

Proof For the estimation {circumflex over (P)}=P/{circumflex over (Q)} of the ideal exposure-eliminated sample, W_(c)({circumflex over (P)}, P*)≤ρ is equivalent to W_(c)(P/{circumflex over (Q)}, P/Q_(o))≤ρ.

602905331.2 24

Observe that when P is given by the empirical distribution that assigns uniform weights to all samples, the Wasserstein's distance W_(c)(P/{circumflex over (Q)}, P/Q_(o))≤ρ is convex in {circumflex over (Q)}⁽⁻¹⁾ (since c is convex) and {circumflex over (Q)}=Q_(o) gives W_(c)(P/{circumflex over (Q)}, P/Q_(o))=0.

Since the propensity scores are assumed to be all bounded away from zero, so P/{circumflex over (Q)} and P/Q_(o) exist and have normal behavior. So the duality results can be established, since the Slater's condition holds. Let h=(x, z, y) ∈X and X′ be a copy of X. Thus:

$\begin{matrix} {{\begin{matrix} \sup \\ {\hat{P}:{{W_{c}\left( {\hat{P},P} \right)} \leq \rho}} \end{matrix}{\int{{\delta\left( {y,{f_{\theta}\left( {x,z} \right)}} \right)}\; d\;{\hat{P}(h)}}}} = {{\begin{matrix} \sup \\ {\hat{Q}:{{W_{c}\left( {{P/\hat{Q}},{P/Q_{0}}} \right)} \leq \rho}} \end{matrix}{\int{\frac{\delta\left( {y,{f_{\theta}\left( {x,z} \right)}} \right)}{\hat{q}\left( {{O = \left. 1 \middle| x \right.},z} \right)}d\;{\hat{Q}(h)}}}} = {{\begin{matrix} \inf \\ {\propto \geq 0} \end{matrix}\begin{matrix} \sup \\ \hat{Q} \end{matrix}\left\{ {{{\int{\frac{\delta\left( {y,{f_{\theta}\left( {x,z} \right)}} \right)}{\hat{q}\left( {{O = \left. 1 \middle| x \right.},z} \right)}d\;{\hat{Q}(h)}}} -} \propto {W_{c}\left( {\frac{P}{\hat{Q}},\frac{P}{Q_{0}}} \right)} \propto \rho} \right\}} = {{\begin{matrix} \inf \\ {\propto \geq 0} \end{matrix}\begin{matrix} \sup \\ \hat{Q} \end{matrix}\left\{ {{{\int{\frac{\delta\left( {y,{f_{\theta}\left( {x,z} \right)}} \right)}{\hat{q}\left( {{O = \left. 1 \middle| x \right.},z} \right)}d\;{\hat{Q}(h)}}} -} \propto {{\begin{matrix} \inf \\ {\gamma \in {\prod\left( {{P/\hat{Q}},{P/Q_{0}}} \right)}} \end{matrix}{\int{{c\left( {h,h^{\prime}} \right)}d\;{\gamma\left( {h,h^{\prime}} \right)}}}} +} \propto \rho} \right\}} = {\begin{matrix} \inf \\ {\propto \geq 0} \end{matrix}\begin{matrix} \sup \\ \hat{Q} \end{matrix}\begin{matrix} \sup \\ {\gamma \in {\prod\left( {{P/\hat{Q}},{P/Q_{0}}} \right)}} \end{matrix}\left\{ {{{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right)d\;{\gamma\left( {h,h^{\prime}} \right)}}} +} \propto \rho} \right\}}}}}} & \left( {A{.2}} \right) \end{matrix}$

where in the last line the shorthand notation δ_(f) _(θ) (h):=δ(y, f_(θ)(x, z)) and {circumflex over (q)}(h):={circumflex over (q)}(O=1|x, z) are used. Then notice that

$\begin{matrix} {{\begin{matrix} \sup \\ \hat{Q} \end{matrix}\begin{matrix} \sup \\ {\gamma \in {\prod\left( {\frac{P}{Q},\frac{P}{Q_{0}}} \right)}} \end{matrix}{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right)d\;{\gamma\left( {h,h^{\prime}} \right)}}}} \leq {\int_{\sup\limits_{h \in \mathcal{X}}}{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right){{dQ}_{0}\left( h^{\prime} \right)}}}} & \left( {A{.3}} \right) \end{matrix}$

and it is then shown that the opposite direction also holds so it is always equality. Let

be the space of measurable conditional distributions (Markov kernels) from X to X′, then

$\begin{matrix} {{\begin{matrix} \sup \\ \hat{Q} \end{matrix}\begin{matrix} \sup \\ {\gamma \in {\prod\left( {\frac{P}{Q},\frac{P}{Q_{0}}} \right)}} \end{matrix}{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right)d\mspace{11mu}{\gamma\left( {h,h^{\prime}} \right)}}}} \geq {\sup\limits_{K \in \mathcal{K}}{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right){{dK}\left( h \middle| h^{\prime} \right)}{{dQ}_{0}\left( h^{\prime} \right)}}}}} & \left( {A{.4}} \right) \end{matrix}$

In the next step, consider the space of all measurable mappings h′

h(h′) from X′ to X, denoted by

. Since all the mappings are measurable, the underlying spaces are regular, and δ_(f) _(θ) and c are at least semi-continuous, using standard measure theory arguments for exchanging the integration and supremum, yields

$\begin{matrix} {{\sup\limits_{{h{( \cdot )}} \in \mathcal{H}}{\int{\left( {{\frac{\delta_{f_{\theta}}\left( {h\left( h^{\prime} \right)} \right)}{\hat{q}\left( {h\left( h^{\prime} \right)} \right)} -} \propto {c\left( {{h\left( h^{\prime} \right)},h^{\prime}} \right)}} \right){{dQ}_{0}\left( h^{\prime} \right)}}}} = {\int{{\sup\limits_{h \in \mathcal{X}}\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right)}{{dQ}_{0}\left( h^{\prime} \right)}}}} & \left( {A{.5}} \right) \end{matrix}$

where the h(·) on the LHS represents the mapping, and the h on the RHS still denotes elements from the sample space X. Now let the support of the conditional distribution K(h, h′) be given by h(h′). So according to (A.5):

$\begin{matrix} {{\,_{\sup\limits_{K \in \mathcal{K}}}{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right){{dK}\left( h \middle| h^{\prime} \right)}{{dQ}_{0}\left( h^{\prime} \right)}}}} = {{\,_{\sup\limits_{{h{( \cdot )}} \in \mathcal{H}}}{\int{\left( {{\frac{\delta_{f_{\theta}}\left( {h\left( h^{\prime\;} \right)} \right)}{\hat{q}\left( {h\left( h^{\prime} \right)} \right)} -} \propto {c\left( {{h\left( h^{\prime} \right)},h^{\prime}} \right)}} \right){{dQ}_{0}\left( h^{\prime} \right)}}}} \geq {\int_{\sup\limits_{h \in \mathcal{X}}}{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right){{dQ}_{0}\left( h^{\prime} \right)}}} \geq {\begin{matrix} \sup \\ \hat{Q} \end{matrix}\begin{matrix} \sup \\ {\gamma \in {\prod\left( {\frac{P}{Q},\frac{P}{Q_{0}}} \right)}} \end{matrix}{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right)d\;{\gamma\left( {h,h^{\prime}} \right)}}}}}} & \left( {A{.6}} \right) \end{matrix}$

Combining (A.6), (A.4) and (A.3), see that

$\begin{matrix} {{\begin{matrix} \sup \\ \hat{Q} \end{matrix}\begin{matrix} \sup \\ {\gamma \in {\prod\left( {\frac{P}{Q},\frac{P}{Q_{0}}} \right)}} \end{matrix}{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right)d\mspace{11mu}{\gamma\left( {h,h^{\prime}} \right)}}}} \geq {\int_{\sup\limits_{K \in \mathcal{X}}}{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right){{dQ}_{0}\left( h^{\prime} \right)}}}} & \left( {A{.7}} \right) \end{matrix}$

Finally, notice that

${\begin{matrix} \sup \\ \hat{Q} \end{matrix}\begin{matrix} \sup \\ {\gamma \in {\prod\left( {\frac{P}{Q^{\prime}},\frac{P}{Q_{0}}} \right)}} \end{matrix}{\int{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right)d\mspace{11mu}{\gamma\left( {h,h^{\prime}} \right)}}}} \geq {{\sup\limits_{\hat{Q}}{\int{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)}{{dQ}_{0}(h)}}}} -} \propto {W_{c}\left( {{P/\hat{Q}},{P/Q_{0}}} \right)}$

so according to (A.2), reach the final result:

$\begin{matrix} {{\begin{matrix} \sup \\ {\hat{P}:{{W_{c}\left( {\hat{P},P} \right)} \leq p}} \end{matrix}{\int{{\delta\left( {y,{f_{\theta}\left( {x,z} \right)}} \right)}d\;{\hat{P}(h)}}}} = {{\begin{matrix} \inf \\ {\propto \geq 0} \end{matrix}\left\{ {\propto \;{p + {\int_{\sup\limits_{h \in \mathcal{X}}}{\left( {{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} -} \propto {c\left( {h,h^{\prime}} \right)}} \right){{dQ}_{0}\left( h^{\prime} \right)}}}}} \right\}} = {\begin{matrix} \inf \\ {\propto \geq 0} \end{matrix}\left\{ {\propto {p + {\sup\limits_{\hat{Q}}{\int{\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)}d\;{{\hat{Q}}_{0}(h)}}}} -} \propto {W_{c}\left( {{P/\hat{Q}},{P/Q_{0}}} \right)}} \right\}}}} & \left( {A{.8}} \right) \end{matrix}$

To reach the relaxation given in (5), use the alternate expression for the Wasserstein distance obtained from the Kantorovich-Rubinstein duality. Denote the Lipschitz continuity for a function f by ∥f∥_(L≤l). When the cost function c is l-Lipschitz continuous, W_(c)(P₁, P₂) is also referred to as the Wasserstein-l distance. Without loss of generality, consider ∥c∥_(L≤1) such as the

₂ norm, and with that the Wasserstein distance is equivalent to:

$\begin{matrix} {{W_{c}\left( {{P/\hat{Q}},{P/Q_{0}}} \right)} = {\begin{matrix} \sup \\ {f}_{L \leq 1} \end{matrix}\left\{ {{{\mathbb{E}}_{{h\sim P_{n}}/\hat{Q}}{f(h)}} - {{\mathbb{E}}_{{h\sim P}/Q_{0}}{f(h)}}} \right\}}} & \left( {A{.9}} \right) \end{matrix}$

where f:X→

. In practice, when P is the empirical distribution that assigns uniform weights to all the samples:

$\begin{matrix} {{W_{c}\left( {{P/\hat{Q}},{P/Q_{0}}} \right)} = {{\sup\limits_{{f}_{L \leq 1}}\left\{ {{{\mathbb{E}}_{{h\sim P_{n}}/\hat{Q}}{f(h)}} - {{\mathbb{E}}_{{h\sim P_{n}}/Q_{0}}{f(h)}}} \right\}} = {{\begin{matrix} \sup \\ {f}_{L \leq 1} \end{matrix}\left\{ {{a_{1}{\mathbb{E}}_{h\sim P_{n}}\frac{f(h)}{\hat{q}(h)}} - {a_{2}{\mathbb{E}}_{h\sim P_{n}}\frac{f(h)}{q_{0}(h)}}} \right\}} = {{\begin{matrix} \sup \\ {f}_{L \leq 1} \end{matrix}{{\mathbb{E}}_{h\sim P_{n}}\left\lbrack {\frac{f(h)}{{\hat{q}(h)} \cdot {q_{0}(h)}}\left( {{a_{1}{q_{0}(h)}} - {a_{2}{\hat{q}(h)}}} \right)} \right\rbrack}} \leq {\begin{matrix} \sup \\ {h \in \mathcal{X}} \end{matrix}{\left\{ \frac{1}{{\hat{q}(h)} \cdot {q_{0}(h)}} \right\} \cdot \begin{matrix} \sup \\ {f}_{L \leq 1} \end{matrix}}\left\{ {{a_{3}{\mathbb{E}}_{{h\sim P_{n}}Q_{0}}{f(h)}} - {a_{4}{\mathbb{E}}_{{h\sim P_{n}} \cdot \hat{Q}}{f(h)}}} \right\}} \leq {\frac{1}{\mu^{2}}\begin{matrix} \sup \\ {{f}_{L} \leq {\max\left\{ {a_{5},a_{6}} \right\}}} \end{matrix}\left\{ {{{\mathbb{E}}_{h\sim Q_{0}}{f(h)}} - {{\mathbb{E}}_{h\sim\hat{Q}}{f(h)}}} \right\}} \leq {\frac{1}{\mu^{2}}{W_{\overset{\sim}{c}}\left( {\hat{Q},Q_{0}} \right)}}}}}} & \left( {A{.10}} \right) \end{matrix}$

where the-above a_(i) are all constants induced by using the change-of-measure with important-weighting estimators, and the induced cost function {tilde over (c)} on the last line satisfies |{tilde over (c)}|_(L)≤max{a₅, a₆}. Therefore, see that the Wasserstein distance between P_(n)/{tilde over (Q)} and P_(n)/Q₀ can be bounded by W_({tilde over (c)})({circumflex over (Q)}, Q₀). Hence, for each ∝≥0 in (A.8),

${{{\begin{matrix} {\sup\;{\mathbb{E}}_{P}} \\ \hat{Q} \end{matrix}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{{\hat{q}\left( {{O = \left. 1 \middle| X \right.},Z} \right)} \cdot {q_{0}(h)}} \right\rbrack} -}\overset{\sim}{\propto}{W_{\overset{\sim}{c}}\left( {\hat{Q},Q_{0}} \right)}},{\overset{\sim}{\propto} \geq 0},$

is a relaxation of the result in Assertion 2. In practice, the specific forms of the cost functions c or {tilde over (c)} do not matter, because the Wasserstein distance is intractable and the data-dependent surrogates discussed below in connection with practical implementations can be used.

Considering the relaxation for each fixed ∝ (see the appendix), the minimax objective has a desirable formulation where ∝ becomes a tuning parameter:

$\begin{matrix} {{{\underset{f_{\theta} \in F}{minimize}\begin{matrix} \sup \\ \hat{Q} \end{matrix}{{\mathbb{E}}_{p}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{\hat{q}\left( {{O = \left. 1 \middle| X \right.},Z} \right)} \right\rbrack}} - {\alpha\;{W_{c}\left( {\hat{Q},Q_{0}} \right)}\mspace{14mu}\alpha}} \geq 0.} & (6) \end{matrix}$

To make sense of (6), see that while {circumflex over (Q)} is acting adversarially against f_(θ) as the inverse weights in the first term, it cannot arbitrarily increase the objective function, since the second terms acts as a regularizer that keeps {circumflex over (Q)} close to the true exposure mechanism Q₀. Compared with the primal problem in (5), the relaxed dual formulation in (6) gives the desired unconstrained optimization problem. Also, note that the exposure mechanism is often given by the recommender system that was operating during the data collection, which can be leveraged as a domain knowledge to further convert (6) to a more tractable formulation. Let g* be the recommendation model that underlies Q₀. Assume for now that P_(g) (0=1|X,Z) is given by G(g(X,Z)) ∈ (μ,1), μ>0 for some transformation function G. The inclusion and manipulation of the unobserved factors is discussed below in connection with practical implementations. The objective in (6) can then be converted to a two-model adversarial game:

$\begin{matrix} {{{\underset{f_{\theta} \in \mathcal{F}}{minimize}\begin{matrix} \sup \\ {g_{\psi} \in \mathcal{G}} \end{matrix}{{\mathbb{E}}_{p}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G\left( {g_{\varphi}\left( {X,Z} \right)} \right)} \right\rbrack}} - {\alpha\;{W_{c}\left( {{G\left( g_{\psi} \right)},{G\left( g^{*} \right)}} \right)}}},{\alpha \geq 0.}} & (7) \end{matrix}$

Before discussing the implications of (7), its practical implementations, and the minimax optimization, the theoretical guarantees for the generalization error are shown and discussed, in comparison to the standard ERM setting, after introducing the adversarial component.

Theoretical Property

Before stating results, the loss function corresponding to the adversarial objective can be characterized, as well as the complexity of the hypothesis space. For the first purpose, the cost-regulated loss is introduced, which is defined as:

${{\Delta_{\gamma}\left( {f_{\theta};\left( {x,z,y} \right)} \right)} = {\sup_{{({x^{\prime},z^{\prime},y^{\prime}})} \in^{x}}\left\{ {\frac{\delta\left( {y^{\prime},{f_{\theta}\left( {x^{\prime},z^{\prime}} \right)}} \right)}{q\left( {{O = \left. 1 \middle| x^{\prime} \right.},z^{\prime}} \right)} - {\gamma\;{c\left( {\left( {x,z,y} \right),\left( {x^{\prime},z^{\prime},y^{\prime}} \right)} \right)}}} \right\}}},$

For the second purpose, consider the entropy integral J(

)=∫₀ ^(∞)√{square root over (log(ϵ, ∥.∥∞)d∈)}, where

={δ(f_(θ), .)|f_(θ) ∈

}is the hypothesis class and

(ϵ;

, ∥.∥∞) gives the covering number for the ϵ−cover of

in terms of the ∥.∥∞ norm. Suppose that |δ(y, f_(θ)(x, z))|≥M holds uniformly. The theoretical result on the worst-case generalization bound under the minimax setting.

Theorem 1. Suppose the mapping G from g_(ψ) to q(o=1|x, z) is to one-to-one and surjective with g_(ψ) ∈

. Let

(ρ)={g ∈

|W_(c)(G(g*))≤ρ}. Then under the conditions specified in Assertion 2, for all γ≥0 and ρ>0, the following inequality holds with probability at least 1−ϵ:

${\begin{matrix} \sup \\ {g_{\psi} \in {\overset{\sim}{\mathcal{G}}(\rho)}} \end{matrix}{{\mathbb{E}}_{P}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G\left( {g_{\psi}\left( {X,Z} \right)} \right)} \right\rbrack}} \leq {{c_{1}\gamma\;\rho} + {{\mathbb{E}}_{Pn}\left\lbrack {{\left\lceil {\Delta_{\gamma}\left( {f_{\theta};\left( {X,Z,Y} \right)} \right)} \right\rceil + \frac{{24\left( \overset{\sim}{\mathcal{F}} \right)} + {c_{2}\left( {M,\sqrt{\log\;\frac{2}{\epsilon}},\gamma} \right)}}{\sqrt{n}}},} \right.}}$

where c₁ is a positive constants and c₂ is a simple linear function with positive weights.

The proof of Theorem 1 is provided as follows:

Proof Following the same arguments from the proof in Assertion 2, a result similar to that stated in (A.8) is

$\begin{matrix} {{{{\begin{matrix} \sup \\ {g_{\psi} \in {\overset{\sim}{\mathcal{G}}(\rho)}} \end{matrix}{{\mathbb{E}}_{P}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G\left( {g_{\psi}\left( {X,Z} \right)} \right)} \right\rbrack}} \leq {\begin{matrix} \inf \\ {\gamma \geq 0} \end{matrix}\left\{ {{\gamma\rho} + {\int{{\sup_{h \in \mathcal{X}}\left( {\frac{\delta_{f_{\theta}}(h)}{\hat{q}(h)} - {\gamma\;{c\left( {h,h^{\prime}} \right)}}} \right)}{{dP}(h)}}}} \right\}}} = {\begin{matrix} \inf \\ {\gamma \geq 0} \end{matrix}\left\{ {{\gamma\rho} + {{\mathbb{E}}_{P}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack}} \right\}}}{\left( {{by}\mspace{14mu}{the}\mspace{14mu}{definition}\mspace{14mu}{of}\mspace{14mu}\Delta_{\gamma}} \right) \leq {\begin{matrix} \inf \\ {\gamma \geq 0} \end{matrix}\left\{ {{\gamma\rho} + {{\mathbb{E}}_{P_{n}}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack} + {\begin{matrix} \sup \\ {f_{\theta} \in \mathcal{F}} \end{matrix}\left( {{{\mathbb{E}}_{P}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack} - {{\mathbb{E}}_{P_{n}}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack}} \right)}} \right\}}}\mspace{79mu}{{{{Let}\mspace{14mu} W_{\gamma}} = {\sup_{f_{\theta} \in \mathcal{F}}\left( {{{\mathbb{E}}_{P}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack} - {{\mathbb{E}}_{P_{n}}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack}} \right)}},}} & \left( {A{.11}} \right) \end{matrix}$

then notice that

$W_{\gamma} = {\frac{1}{n}{\begin{matrix} \sup \\ {f_{\theta} \in \mathcal{F}} \end{matrix}\left\lbrack {{{\sum\limits_{i = 1}^{N}{{\mathbb{E}}_{P}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack}} - {{\Delta_{\gamma}\left( {f_{\theta};H_{i}} \right\rbrack}\mspace{14mu}\gamma}} \geq 0.} \right.}}$

Since |δ_(f) _(θ) (h)|≤μM holds uniformly, according to the McDiarmid's inequality on bounded random variables:

$\begin{matrix} {{p\left( {{W_{\gamma} - {{\mathbb{E}}\; W_{\gamma}}} \geq {\mu\; M\sqrt{\frac{\log\;{1/\epsilon}}{2n}}}} \right)} \leq {\epsilon.}} & \left( {A{.12}} \right) \end{matrix}$

Then let ϵ₁, . . . , ϵ_(N) be the i.i.d Rademacher random variables independent of H, and H′_(i) be the i.i.d copy of H_(i) for i=1, . . . , N.

Applying the symmetrization argument, see that

$\begin{matrix} {{{\mathbb{E}}\; W_{\gamma}} = {{{\mathbb{E}}\left\lbrack {\begin{matrix} \sup \\ {f_{\theta} \in \mathcal{F}} \end{matrix}{{{\sum\limits_{i = y}^{N}{\Delta_{\gamma}\left( {f_{\theta};H_{i}^{\prime}} \right)}} - {\sum\limits_{i = y}^{N}{\Delta_{\gamma}\left( {f_{\theta};H_{i}} \right)}}}}} \right\rbrack} = {{{\mathbb{E}}\left\lbrack {\begin{matrix} \sup \\ {f_{\theta} \in \mathcal{F}} \end{matrix}{{{\frac{1}{N}{\sum_{i = 1}^{N}{\epsilon_{1}{\Delta_{\gamma}\left( {f_{\theta};H_{i}^{\prime}} \right)}}}} - {\frac{1}{N}{\sum_{i = 1}^{N}{\Delta_{\gamma}\left( {f_{\theta};H_{i}} \right)}}}}}} \right\rbrack} \leq {2{{{\mathbb{E}}\left\lbrack {\begin{matrix} \sup \\ {f_{\theta} \in \mathcal{F}} \end{matrix}{{\frac{1}{N}{\sum\limits_{i = 1}^{N}{\epsilon_{1}{\Delta_{\gamma}\left( {f_{\theta};H_{i}} \right)}}}}}} \right\rbrack}.}}}}} & \left( {A{.13}} \right) \end{matrix}$

It is clear that each ϵ_(i)Δ_(γ)(f_(θ); H_(i)) is zero-mean, and it is now shown that it is sub-Gaussian as well.

For any two f_(θ), f′_(θ), the bounded difference is shown:

$\begin{matrix} {\left. {{{{\mathbb{E}}\left\lbrack {\exp\left( {\lambda\left( {{\frac{1}{\sqrt{N}}\epsilon_{i}{\Delta_{\gamma}\left( {f_{\theta};H_{i}} \right)}} - {\frac{1}{\sqrt{N}}\epsilon_{i}{\Delta_{\gamma}\left( {f_{\theta}^{\prime};H_{i}} \right)}}} \right)} \right)} \right\rbrack} = {\left( {{\mathbb{E}}\left\lbrack {\exp\left( {\frac{\lambda}{\sqrt{N}}{\epsilon_{1}\left( {{\Delta_{\gamma}\left( {f_{\theta};H_{1}} \right)} - {\Delta_{\gamma}\left( {f_{\theta}^{1};H_{1}} \right)}} \right)}} \right)} \right\rbrack} \right)^{N} = {\left( {{\mathbb{E}}\left\lbrack {\exp\left( {\frac{\lambda}{\sqrt{N}}{\epsilon_{1}\left( {{\begin{matrix} \sup \\ h^{\prime} \end{matrix}\begin{matrix} \inf \\ h^{\prime\prime} \end{matrix}\left\{ {\frac{\delta_{f_{\theta}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)} - {\gamma\;{c\left( {H_{1},h^{\prime}} \right)}} - \frac{\delta_{f_{\theta}^{\prime}}\left( h^{\prime\prime} \right)}{q\left( h^{\prime\prime} \right)}} \right\}} + {\gamma\;{c\left( {H_{1},h^{\prime\prime}} \right)}}} \right)}} \right)} \right\rbrack} \right)^{N} \leq \left( {{\mathbb{E}}\left\lbrack {\exp\left( {\frac{\lambda}{\sqrt{N}}{\epsilon_{1}\left( {\begin{matrix} \sup \\ h^{\prime} \end{matrix}\left\{ {\frac{\delta_{f_{\theta}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)} - \frac{\delta_{f_{\theta}^{\prime}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)}} \right\}} \right)}} \right)} \right\rbrack} \right)^{N} \leq {\exp\left( {\lambda^{2}{{{\frac{\delta_{f_{\theta}}}{q} - \frac{\delta_{f_{\theta}^{\prime}}}{q}}}_{\infty}^{2}/2}} \right)}}}}\mspace{79mu}{\left( {{by}\mspace{14mu}{Hoeffding}}’ \right.s\mspace{14mu}{inequality}}} \right).} & \left( {A{.14}} \right) \end{matrix}$

Hence, see that

$\frac{1}{\sqrt{N}}\epsilon_{i}{\Delta_{\gamma}\left( {f_{\theta};H_{i}} \right)}$

is sub-Gaussian with respect to

${{\frac{\delta_{f_{\theta}}}{q} - \frac{\delta_{f_{\theta}^{\prime}}}{q}}}_{\infty}^{2}.$

Therefore,

W_(γ) can be bounded by using the standard technique for Rademacher complexity and Dudley's entropy integral:

$\begin{matrix} {{{\mathbb{E}}\; W_{\gamma}} \leq {\frac{24}{N}{{\mathcal{J}\left( \overset{\sim}{\mathcal{F}} \right)}.}}} & \left( {A{.15}} \right) \end{matrix}$

Combining all the above bounds in (A.11), (A.12) and (A.15) obtains the desired result.

The generalization bound in Theorem 1 holds for all ρ and δ, and is it shown that when they are decided by some data-dependent quantities, the result can be converted to some simplified forms that reveal the more direct connections with the propensity-weighted loss and standard ERM results:

Corollary 1. Following the statements in Theorem 1, there exists some data-dependent γ_(n) and p_(n)(f_(θ)), such that when γ≥γ_(n), for all p>0:

${{\Pr\left( {{\sup\limits_{g_{\varphi} \in {\overset{\sim}{\mathcal{G}}{(\rho)}}}{{\mathbb{E}}_{P}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G\left( {g_{\varphi}\left( {X,Z} \right)} \right)} \right\rbrack}} \leq {{c_{1}{\gamma\rho}} + {{\mathbb{E}}_{P_{n}}\left\lbrack \frac{\delta\left( {f_{\theta};\left( {X,Z,Y} \right)} \right)}{\left. {q\left( {{O = \left| X \right.},Z} \right)} \right)} \right\rbrack} + {ɛ_{n}(\epsilon)}}} \right)} > {1 - \epsilon}};$

and when ρ=ρ_(n)(f_(θ)), for all γ≥0:

${{\Pr\left( {{\sup\limits_{g_{\varphi} \in {\overset{\sim}{\mathcal{G}}{(\rho)}}}{{\mathbb{E}}_{P}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G\left( {g_{\varphi}\left( {X,Z} \right)} \right)} \right\rbrack}} \leq {{\sup\limits_{P:{{W_{c}{({P,P_{n}})}} \leq \overset{\sim}{\rho}}}{{\mathbb{E}}_{P}\left\lbrack \frac{\delta\left( {f_{\theta};\left( {X,Z,Y} \right)} \right)}{\left. {q\left( {{O = \left. 1 \middle| X \right.},Z} \right)} \right)} \right\rbrack}} + {ɛ_{n}(\epsilon)}}} \right)} > {1 - \epsilon}},\mspace{79mu}{{{where}\mspace{14mu}{ɛ_{n}(\epsilon)}} = \left( {{24{\mathcal{J}\left( \overset{\sim}{\mathcal{F}} \right)}} + {{c_{2}\left( {M,\sqrt{\left. {{\log\frac{2}{\epsilon}},\gamma} \right)}} \right)}/\sqrt{n}}} \right.}$

as suggested by Theorem 1.

The proof of Corollary 1 is provided as follows:

Proof To obtain the first result, let the data-dependent γn be given by

$\gamma_{n} = {\begin{matrix} \max \\ i \end{matrix}\begin{matrix} \sup \\ {h^{\prime} \in \mathcal{H}} \end{matrix}{\left( {\frac{\delta_{f_{\theta}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)} - \frac{\delta_{f_{\theta}}\left( h_{i} \right)}{q\left( h_{i} \right)}} \right)/{{c\left( {h_{i},h^{\prime}} \right)}.}}}$

Then according to the definition of Δ_(γ):

${{\mathbb{E}}_{P_{n}}{\Delta_{\gamma_{n}}\left( {f_{\theta};H} \right)}} = {\frac{1}{N}{\sum\limits_{i}{\begin{matrix} \sup \\ {h^{\prime} \in \mathcal{X}} \end{matrix}{\left\{ {\frac{\delta_{f_{\theta}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)} - {\max\limits_{j}{\sup\limits_{h^{\prime\prime} \in \mathcal{X}}\left\{ \frac{\frac{\delta_{f_{\theta}}\left( h^{\prime\prime} \right)}{q\left( h^{\prime\prime} \right)} - \frac{\delta_{f_{\theta}}\left( h_{j} \right)}{q\left( h_{j} \right)}}{c\left( {h_{j},h^{\prime\prime}} \right)} \right\}{c\left( {h_{i},h^{\prime}} \right)}}}} \right\}.}}}}$

It is straightforward to verify that

${{{{\mathbb{E}}_{P_{n}}{\Delta_{\gamma_{n}}\left( {f_{\theta};H} \right)}} \leq {{\frac{1}{N}{\sum\limits_{i}{\sup\limits_{h^{\prime\prime} \in \chi}\left\{ \frac{\delta_{f_{\theta}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)} \right\}}}} + \frac{\delta_{f_{\theta}}\left( h_{i} \right)}{q\left( h_{i} \right)} - {\sup\limits_{h^{\prime\prime} \in \chi}\left\{ \frac{\delta_{f_{\theta}}\left( h^{\prime\prime} \right)}{q\left( h^{\prime\prime} \right)} \right\}}}} = {\frac{1}{N}{\sum\limits_{i}\frac{\delta_{f_{\theta}}\left( h_{i} \right)}{q\left( h_{i} \right)}}}},$

as well as

${{\mathbb{E}}_{P_{n}}{\Delta_{\gamma_{n}}\left( {f_{\theta};H} \right)}} \geq {{\frac{1}{N}{\sum\limits_{i}{\sup\limits_{h^{\prime} \in \chi}\left\{ \frac{\delta_{f_{\theta}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)} \right\}}}} - {\max\limits_{j}{\sup\limits_{h^{\prime\prime} \in \chi}\left\{ {\frac{\frac{\delta_{f_{\theta}}\left( h^{\prime\prime} \right)}{q\left( h^{\prime\prime} \right)} - \frac{\delta_{f_{\theta}}\left( h_{j} \right)}{q\left( h_{j} \right)}}{c\left( {h_{j},h^{\prime\prime}} \right)}{c\left( {h_{i},h_{j}} \right)}} \right\}}}}$

Which also equals to

$\frac{1}{N}\Sigma_{i}\sup\limits_{h^{\prime\prime} \in \chi}{\left\{ \frac{\delta_{f_{\theta}}\left( h_{i} \right)}{q\left( h_{i} \right)} \right\}.}$

Therefore, when γ=γ_(n), then

${{\mathbb{E}}_{P_{n}}\left\lbrack {\Delta_{\gamma_{n}}\left( {f_{\theta};H} \right)} \right\rbrack} = {{{\mathbb{E}}_{P_{n}}\left\lbrack \frac{\delta_{f_{\theta}}\left( H_{i} \right)}{q\left( H_{i} \right)} \right\rbrack}.}$

Similarly, it can be shown that when γ=γ_(n), the above equality also holds. Hence, replace

_(P) _(n) [Δ_(γ) _(n) (f_(θ);H)] with

${\mathbb{E}}_{P_{n}}\left\lbrack \frac{\delta_{f_{\theta}}\left( H_{i} \right)}{q\left( H_{i} \right)} \right\rbrack$

in Theorem 1 and obtain the first result.

To obtain the second result, define the transportation map:

${T_{\gamma}\left( {f_{\theta};h} \right)} = {\arg{\max\limits_{h^{''} \in \chi}{\left\{ {\frac{\delta_{f_{\theta}}\left( h^{\prime} \right)}{q\left( h^{\prime} \right)} - {\gamma{c\left( {h,h^{\prime}} \right)}}} \right\}.}}}$

Then according to (A.8), the empirical maximizer for sup_({circumflex over (P)}:W) _(c) _(({circumflex over (P)},P*)≤) _(p) ∫δ(y, f_(θ)(x, z))dP(h) is attained by

${\overset{\hat{}}{P}\left( f_{\theta} \right)} = {\frac{1}{N}{\sum\limits_{i = 1}^{N}I_{T_{\gamma}{({f_{\theta};h_{i}})}}}}$

where I_(h) assign point mass at h, since it maximizes

$\int{{\sup_{h \in x}\left( {\frac{\delta_{f_{\theta}}(h)}{\overset{\hat{}}{q}(h)} - {\gamma{c\left( {h,h^{\prime}} \right)}}} \right)}d{{Q_{0}\left( h^{\prime} \right)}.}}$

Then let ρ_(n)(f_(θ))=W_(c)({circumflex over (P)}(f_(θ)), P_(n)), which equals to

_(P) _(n) [c(T_(γ)(f_(θ); H), H] by definition. Now,

${{{c_{1}\gamma{\rho_{n}\left( f_{\theta} \right)}} + {{\mathbb{E}}_{P_{n}}\left\lbrack {\Delta_{\gamma}\left( {f_{\theta};H} \right)} \right\rbrack}} = {\begin{matrix} \sup \\ {P:{{W_{c}\left( {P,P_{n}} \right)} \leq \overset{˜}{\rho}}} \end{matrix}{{\mathbb{E}}_{P}\left\lbrack {{{\delta\left( {f_{\theta};H} \right)}/q}(H)} \right\rbrack}}},$

for some {tilde over (ρ)} that absorbs the excessive constant terms, which can be plugged it into Theorem 1 to obtain the second result for Corollary 1.

Corollary 1 shows that the approach described herein has the same 1/√{square root over (n)} rate as the standard ERM. Also, the first result reveals an extra δρ bias term induced by the adversarial setting, the second result characterizes how the additional uncertainty is reflected on the propensity-weighted empirical loss.

Practical Implementation

Directly optimizing the minimax objective in (7) can be infeasible because g* is unknown and the Wasserstein distance is hard to compute when

is a complicated model such as neural network. Nevertheless, understanding the comparative roles of f_(θ) and g_(φ) can help in constructing practical solutions.

Recall a goal of optimizing f_(θ). The auxiliary g_(φ) is introduced to characterize the adversarial exposure mechanism, so there is less interest in recovering the true g*. With that being said, the term W_(c)(G(_(φ)), G(g*)) serves to establish certain regularizations on such that it is constrained by the underlying exposure mechanism. Relaxing or tightening the regularization term should not significantly impact the solution because the regularization parameter a can be adjusted. Hence, tractable regularizers can be designed to approximate or even replace W_(c)(G(g_(φ)), G(g*)), as long as the constraint on g_(φ) is established under the same principle. Similar ideas have also been applied to train the generative adversarial network (GAN): the optimal classifier depends on the unknown data distribution, so in practice, people use alternative tractable classifiers that fit into the problem. Several alternative regularizers for g_(φ) are listed below.

-   -   In the explicit feedback data setting, the exposure status is         partially observed, so the loss of G(g_(φ)) on the         partially-observed exposure data can be used as the regularizer,         i.e.,

${\frac{1}{D_{\exp}}{\sum\limits_{{({u,i})} \in \mathcal{D}_{\exp}}{\varnothing\left( {g_{\varphi}\left( {x_{u},z_{i}} \right)} \right)}}},{{{where}\mspace{14mu}\mathcal{D}_{\exp}} = {\left\{ {\left. {\left( {u,i} \right) \in \mathcal{D}} \middle| o_{u,i} \right. = 1} \right\}.}}$

-   -   For the content-based recommendations, the exposure often can         have high correlation with popularity where the popular items         are more likely to be recommended. So the regularizer may         leverage the empirical popularity via:

${corr}{\left( {{\frac{1}{m}{\sum\limits_{u}{G\left( {g_{\varphi}\left( {X_{u},Z_{i}} \right)} \right)}}},{\frac{1}{m}{\sum\limits_{u}Y_{u,i}}}} \right).}$

-   -   In the implicit feedback setting, if all the other choices are         impractical, simply use the loss on the feedback data as a         regularizer:         _(Pn)[∅(Y·g_(φ)(X, Z))]. The loss-based regularizer is         meaningful because g* is often determined by some other         recommendation models. If it happens that g*∈         , similar performances can be expected from g_(φ) and g* on the         same feedback data since the exposure mechanism is determined by         g* itself.

The third example is focused on because it applies to almost all cases without requiring excessive assumptions. Therefore, the practical adversarial objective is now given by:

$\begin{matrix} {{{\underset{f_{\theta}{\epsilon\mathcal{F}}}{minimize}\sup\limits_{g_{\varphi} \in \mathcal{G}}{{\mathbb{E}}_{P_{n}}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G\left( {g_{\varphi}\left( {X,Z} \right)} \right)} \right\rbrack}} - {{\alpha\mathbb{E}}_{P_{n}}\left\lbrack {\delta\left( {Y,{g_{\varphi}\left( {X,Z} \right)}} \right)} \right\rbrack}},{\alpha \geq 0.}} & (8) \end{matrix}$

In the next step, it is considered how to handle the unobserved factors that also plays a part in the exposure mechanism. As mentioned above, having unobserved factors is inevitable practically. In particular, the Tukey's factorization used in the missing data approaches can be leveraged. In the presence of unobserved factors, Tukey's factorization suggests additionally characterizing the relationship between exposure mechanism and outcome.

For clarity, a simple logistic-regression to model G can be employed as:

G _(β)(g _(φ)(x, z), y)=σ(β₀+β₁g_(φ)(x, z)+β₂ y),

where σ(·) is the sigmoid function. The final form of the adversarial game can be expressed as follows:

$\begin{matrix} {{{\underset{{f_{\theta} \in \mathcal{F}},\beta}{minimize}\sup\limits_{g_{\varphi} \in \mathcal{G}}{{\mathbb{E}}_{P_{n}}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G_{\beta}\left( {{g_{\varphi}\left( {X,Z} \right)},Y} \right)} \right\rbrack}} - {{\alpha\mathbb{E}}_{P_{n}}\left\lbrack {\delta\left( {Y,{g_{\varphi}\left( {X,Z} \right)}} \right)} \right\rbrack}},{\alpha \geq 0.}} & (9) \end{matrix}$

β can be placed to the minimization problem for the following reason. By design, G_(β) merely characterizes the potential impact of unobserved factors which is not considered to act adversarially. Otherwise, the adversarial model can be too strong for f_(θ) to learn anything useful.

Tukey's factorization can have implications on unobserved factors for exposure. Tukey's factorization can be used by the G_(β) model to handle the unobserved factors in recommender system.

The following notation is used for counterfactual outcome: Y_(u,i)(o), o ∈ {0,1}, which represents what the user feedback would be if the exposure O_(u,i) were given by o ∈ {0,1}. In the factual world, observation can be limited to Y_(u,i) for either O_(u,i)=1 or O_(u,i)=0, and the tuple (Y_(u,i)(1), Y_(u,i)(0)]) is not jointly observed at the same time.

In the absence of unobserved factors, the joint distribution of (Y_(u,i)(1), Y_(u,i)(0)) has a straightforward formulation and can be estimated effectively from data using tools from causal inference. However, when unobserved factor exists, there are confounding between (Y_(u,i)(1), Y_(u,i)(0)), which violates a fundamental assumption of many causal inference solutions.

The Tukey's factorization, on the other hand, characterizes the missing data distribution regardless of the unobserved factors as:

$\begin{matrix} {{{p_{\beta}\left( {{Y(o)},\left. O \middle| X \right.,Z} \right)} = {{p\left( {{\left. {Y(o)} \middle| O \right. = o},X,Z} \right)}{{p\left( {{O = {o❘X}},Z} \right)} \cdot \frac{p_{\beta}\left( {\left. O \middle| {{Y(o)}X} \right.,Z} \right)}{p_{\beta}\left( {{O = \left. o \middle| {Y(o)} \right.},X,Z} \right)}}}},{o \in \left\{ {o,1,} \right\}}} & \left( {A{.16}} \right) \end{matrix}$

where

$\frac{p_{\beta}\left( {\left. O \middle| {{Y(o)}X} \right.,Z} \right)}{p_{\beta}\left( {{O = \left. o \middle| {{Y(o)}X} \right.},Z} \right)}$

concludes the unknown mechanism in the missing data distribution.

To see how the counterfactual outcome is reflected in the above formulation, when O=õ:=1−o and o=1:

${{p_{\beta}\left( {{Y(1)},{O = \left. 0 \middle| X \right.},Z} \right)} = {{p\left( {{{{Y(1)}❘O} = 1},X,Z} \right)}{{p\left( {{O = {1❘X}},Z} \right)} \cdot \frac{p_{\beta}\left( {{O = \left. 0 \middle| {Y(1)} \right.},X,Z} \right)}{p_{\beta}\left( {{O = \left. 1 \middle| {Y(1)} \right.},X,Z} \right)}}}},$

which gives the joint distribution of the outcome if the item was not exposed and the observed data where the item is exposed. Notice that both p(Y(o)|O=o, X, Z) and p(O=o|X, Z) can be estimated from the data, since Y(o) is observed under O=o. So the unknown mechanism in the missing data distribution is:

p _(β)(O|Y(o), X, Z)/p _(β)(O=o|Y(o), X, Z).

Hence, the counterfactual outcome distribution can be given by:

p _(β)(Y(o)|O=1−o, X, Z)∝p _(obs)(Y(o)|O=o, X, Z)/G _(β)(Y(o), X, Z)o ∈{0, 1},  (A.17)

where p_(obs) denotes the observable distribution and

${G_{\beta}\left( {{Y(o)},X,Z} \right)} = \frac{p_{\beta}\left( {{O = \left. o \middle| {Y(o)} \right.},X,Z} \right)}{p_{\beta}\left( {\left. O \middle| {Y(o)} \right.,X,Z} \right)}$

characterizes the exposure mechanism even when unobserved factors exist.

The unknown G_(β)(Y(o), X, Z) can be treated as a learnable objective in this setting. As discussed herein, g_(ψ) can be used to characterize the role of X and Z in the exposure mechanism G_(β), hence the formulation of

$\frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G_{\beta}\left( {Y,{g_{\psi}\left( {X,Z} \right)}} \right)}$

in (9).

Including Y in modeling the exposure mechanism can cause the so-called self-selection problem in causal inference. This setting does not fall into that category, since the objective is to learn the f_(θ), rather than making inference on its treatment effect.

It is shown in the ablation studies that if the user feedback Y is not included, i.e., G_(β)(Y, g₁₀₄(X, Z)):=σ(g_(ψ)(X, Z)), the improvements over the original models will be less significant.

Minimax Optimization and Robust Evaluation

In a number of embodiments, to handle the adversarial training, the sequential optimization setup can be adopted in which the players take turn to update their model. Without loss of generality, the objective in (8) can be treated as a function of the two models: min_(f) _(θ) max_(g) _(φ)

(f_(θ)g₁₀₀ ). When

is nonconvex-nonconcave, the classical Minimax Theorem no longer holds, and min_(f) _(θ) max_(g) _(φ)

(f_(θ)g_(φ))≠max_(g) _(φ) m_(f) _(θ)

(f_(θ)g_(φ)). Consequently, which player goes first can have implications. Here, f_(θ) can be trained first because g_(φ) can then choose the worst candidate from the uncertainty set in order to undermines f_(θ). The two-timescale gradient descent ascent (GDA) schema can be used, which can be applied to train adversarial objectives, as provided in Algorithm 1 below. However, the existing analysis on GDA's converging to local Nash equilibrium can assume simultaneous training, so their guarantees do not apply here. Instead, training can continue until the objective stops changing by updating either f_(θ) or g_(φ)

Algorithm 1: Minimax optimization Input: Learning rates r_(θ), r_(φ), discounts d_(θ), d_(φ) > 1; While loss not stabilized do |  θ = θ − r_(θ )

 (f _(θ),g _(φ)); |  φ = φ + r_(φ )

 (f _(θ),g _(φ)); |  r_(θ) = r_(θ)/d_(θ),r_(φ) = r_(φ)/d_(φ); end

Consequently, the stationary points in Algorithm 1 may not attain local Nash equilibrium. Nevertheless, when the timescale of the two models differ significantly (by adjusting the initial learning rates and discounts), it has been shown that the stationary points belong to the local minimax solution up to some degenerate cases. The local minimaxity captures the optimal strategies in the sequential game if both models are allowed to merely change their strategies locally. Hence, Algorithm 1 leads to solutions that are locally optimal. Finally, the role of G_(β) is less relevant in the sequential game, and there are not observed significant differences from updating it before or after f_(θ) and g_(φ).

Recommenders are often evaluated by the mean square error (MSE) on explicit feedback, and by the information retrieval metric such as DCG and NDCG on implicit feedback. After the training, the candidate model f_(θ), as well as the G_(β)(g_(φ)) that gives the worst-case propensity score function specialized for f_(θ), can be obtained. Therefore, instead of pursuing unbiased evaluation, instead consider the robust evaluation by using G_(β)(g_(φ)). It frees the offline evaluation from the potential impact of exposure mechanism, and thus provide a robust view on the true performance. For instance, the robust NDCG can be computed via:

$\frac{1}{\mathcal{D}_{test}}{\sum\limits_{{({u,i})} \in \mathcal{D}_{test}}{{{NDCG}\left( {y_{u,i},{f_{\theta}\left( {x_{u},z_{i}} \right)}} \right)}/{{G_{\beta}\left( {g_{\varphi}\left( {x_{u},z_{i}} \right)} \right)}.}}}$

Experiment and Result

Simulation studies, real-data analysis, as well as online experiments were conducted to demonstrate the various benefits of the adversarial counterfactual learning and evaluation approach described herein. In the simulation study, the synthetic data was generated using real-world explicit feedback dataset so that there is access to the oracle exposure mechanism. It is then shown that models trained by the techniques described herein achieve superior unbiased offline evaluation performances. In the real-world data analysis, it is demonstrated that the models trained by the techniques described herein also achieve more improvements even using the standard offline evaluation. Online experiments also were conducted, which verify that the robust evaluation described herein is more accurate than the standard offline evaluation when compared with the actual online evaluations.

The techniques described herein involve a high-level learning and evaluation approach that are compatible with most of the existing recommendation models, so these well-known baseline models were used to demonstrate the effectiveness of the described herein. Specifically, the popularity-based recommendation (Pop), matrix factorization collaborative filtering (CF), multi-layer perceptron-based CF model (MLP), neural CF (NCF) and the generalized matrix factorization (GMF), are employed as the representatives for the content-based recommendation. The prevailing attention-based model (Attn) also is considered as a representative for the sequential recommendation. Also f_(θ) and g_(φ) are chosen among the above baselines models for this adversarial counterfactual learning. To fully demonstrate the effectiveness of the adversarial training described herein, there was also experimenting with the non-adversarially trained propensity-score method (PS), in which g_(φ) is first optimized merely on the regularization term until convergence, keep it fixed, and then train f_(θ) in the regular propensity-weighted ERM setting. For the sake of notation, the learning approach described herein is listed as the ACL—(adversarial counterfactual learning).

The various methods were examined with the widely-adopted next-item recommendation task. In particular, all but the last two user-item interactions are used for training, the second-to-last interaction is used for validation, and the last interaction is used for testing. The data descriptions, preprocessing steps, train-validation-test split, simulation settings, detailed model configuration as well as the implementation procedure are now described. The training process is visualized that reveals the adversarial nature of the approach described herein. A complete set of ablation study and sensitivity analysis results are provided to demonstrate the robustness of this approach. The implementation and datasets have been made available at https://github.com/StatsDLMathsRecomSys/Adversarial-Counterfactual-Learning-and-Evaluation-for-Recommender-System.

Three real-world datasets are considered, which cover movie, book and music recommendations:

-   -   Movielens-1M. The benchmark dataset records users' ratings for         movies, which includes around 1 millions ratings collected from         60,40 users on 3,952 movies. The rating is from 1 to 5, and a         higher rating indicates more positive feedback. This dataset is         available at         http://files.grouplens.org/datasets/movielens/m1-1m.zip.     -   LastFM. The LastFM dataset is a benchmark dataset for music         recommendation. For each of the 1,892 listeners, they tag the         artists they may find fond of over time. Since the tag is a         binary indicator, the LastFM is an implicit feedback dataset.         There is a total of 186,479 tagging events, where 12,523 artists         have been tagged. This dataset is available at         http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip.     -   GoodReads. The benchmark book recommendation dataset is scraped         from the users' public shelves on Goodread.com. The user review         data on the history and biography sections is used due to their         richness. There are in total 238,450 users, 302,346 unique         books, and 2,066,193 ratings in these sections. The rating range         is also from 1 to 5, a higher rating indicates more positive         feedback. This dataset is available at         https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home.

The Movielens-1M dataset has been filtered before being made available, where each user in the dataset has rated at least 20 movies. For the LastFM and Goodread datasets, infrequent items (books/artists) are first eliminated, as well as users that have fewer than 20 records. After examination, a small proportion of users are found to have an abnormal amount of interactions. Therefore, the users who have more than 1,000 interactions are treated as spam users and not included in the analysis.

The train-validation-test split is carried out based on the order of the user-item interactions. The standard setting is adopted, where for each user interaction sequence, all items but the last two are used in training, the second-to-last interaction is used in validation, and the last interaction is used in testing.

In a modern real-world recommender system, the exposure mechanism is determined by the underlying recommender model as well as various other factors. In an attempt to mimic the real-world recommender systems, a two-stage simulation approach is designed to generate the semi-synthetic data that remains truthful to the signal in the original dataset.

The purpose of the first stage is to learn the characteristic from the data, such as the user relevance (rating) model and the partial exposure model (which may be inaccurate due to the partial-observation of exposure status). In the second stage, the working method of a real-world recommender system is simulated, and the user response is generated accordingly. In order to recover the user-item relevance as accurate as possible, the explicit feedback dataset is used for the simulation, i.e., the Movielens-1M and Goodreads dataset.

In the first stage, given a true rating matrix, two hidden-factor matrix factorization models are trained. The first model tries to recover the rating matrix and by minimizing the mean-squared loss. This model is referred to as the relevance model. Since for the explicit feedback data, the rated items have all been exposed, so given the output

[R_(u,i)|O_(u,i)=1], the relevance probability is defined as

p _(sim1)(Y _(u,i)=1|O _(u,i)=1):=σ(

[R _(u,i) |O _(u,i)=1]+ϵ₁),

where σ(.) is the sigmoid function, and the Gaussian noise ϵ₁ reflects the perturbations brought by unobserved factors. The second model is an implicit-feedback model trained to predict the occurrence of the rating event {circumflex over (p)}(O_(u,i)=1), where instead of using the original ratings, the non-zero entries in the rating matrix are all converted to one.

After obtaining the {circumflex over (p)}(O_(u,i)=1), the simulation exposure probability is defined as log p_(sim1)(O_(u,i)=1)=log {circumflex over (p)}(O_(u,i)=1)+ϵ₂, where ϵ₂ also gives the extra randomness due to the unobserved factors.

Now, after obtaining the simulated p_(sim)(Y_(u,i)=1|O_(u,i)=1) and p_(sim)(O_(u,i)=1), which reflects both the relevance and exposure underlies the real data generating mechanism while taking account of the effects from unobserved factors, the first-stage click data is generated based by:

p _(sim1)(Y _(u,i)=1)=p _(sim1)(Y _(u,i)=1|O _(u,i)=1)p _(sim1)(O _(u,i)=1).

So far, in the first stage, an implicit feedback dataset has been generated that remains truthful to the original real dataset. Now the self-defined components can be added, which gives more control over the exposure mechanism. Specifically, the new user and item hidden factors x, z are obtained by training another implicit matrix factorization model using the generated click data. The extra self-defined exposure function e(x, z) is generated and added to the first-stage p_(sim1), to obtain the second-stage exposure mechanism:

log p _(sim2)(O _(u,i)=1)=log p _(sim1)(O _(u,i)=1)+e(x, z).

The final click data is then generated via:

p _(sim2)(Y _(u,i)=1)=p _(sim1)(Y _(u,i)=1|O _(u,i)=1)p_(sim2)(O _(u,i)=1).

Having the second stage in the simulation is beneficial, because the focus of the first stage is to mimic the generating mechanism of the real-world dataset. The second stage allows control of the exposure mechanism via the extra e(x, z). Also, retraining the implicit matrix factorization model in the beginning of the second stage is not required, thought it can help to better characterize the data generated in the first stage.

For the baseline models considered, other than Pop, the dimension of the user and item hidden factors, initial learning rate, and the

₂ regularization strength are the basic hyperparameters. The initial learning rate is selected from {0.001, 0.005, 0.01, 0.05, 0.1}, and the

₂ regularization strength from {0, 0.01, 0.05, 0.1, 0.2, 0.3}. The tuning parameters are selected separately to avoid excessive computations. The hidden dimension is fixed at 32 for the models in order to achieve fair comparisons in the experiments. Also, notice that this approach has approximately twice the number of parameters with respect to the corresponding baseline model. In practice, the hidden dimension can be treated as a hyperparameter as well. Sensitivity analysis can be provided on the hidden dimension later in this section, which can use the Hit@10 on validation data as the metric for selecting hyperparameters.

To check that the superior performance of this approach is not a consequence of higher model complexity, the hidden factor dimension of the baseline models is doubled to 64 when suitable.

Among the baseline models, the Pop, CF, GMF and Neural CF are all conventional approaches in recommender system that have relatively simpler structures, the default settings are adopted without describing their details. More described is provided for the attention-based sequential recommendation model Attn and the propensity-score method PS. For Attn, the model setting adopted has the self-attention mechanism added on top of an item embedding layer. The hidden dimension of the key, query and value matrices, and the number of dot-product attention heads are treated as the additional tuning parameters. For the PS method, there are two stages:

-   -   1. Obtain g_(ψ)* by minimizing         _(P) _(n) [δ(Y, g_(ψ)(X,Z))] as a standard ERM;     -   2. Implement:

${\underset{{f_{\theta} \in \mathcal{F}},\beta}{minimize}{{\mathbb{E}}_{P_{n}}\left\lbrack \frac{\delta\left( {Y,{f_{\theta}\left( {X,Z} \right)}} \right)}{G_{\beta}\left( {g_{\psi}^{*}\left( {X,Z} \right)} \right)} \right\rbrack}},$

as a propensity-weighted ERM. The tuning parameters for g_(ψ) and f_(θ) are selected in each stage separately.

The configurations for the approach includes two parts: the usual model configuration for f_(θ) and g_(ψ), and the two-timescale train schema. Firstly, the tuning parameters selected for f_(θ) and g_(ψ), when being trained alone also gives the near-optimal performance in the adversarial counterfactual training setting. Therefore, the hyperparameters (other than the learning rate) selected in their individual training for f_(θ) and g_(ψ) are directly adopted. Experimenting is run on several settings for the two-timescale update to understand the impact of the relative magnitude of the initial learning rates r_(θ) and r_(ψ). In practice, the learning rate discount is less relevant when using the Adam optimizer, since the learning rate is automatically adjusted. Intuitively speaking, the smaller the r (relative to r_(θ)), the less g_(ψ) is subject to the regularization in the beginning stage, and its adversarial behavior is less restricted. As a consequence, f_(θ) may not learn anything useful. Empirical evidence to support the-above point is provided in FIG. 5, which illustrates plots 510, 520, 530, and 540, which show results from an adversarial training process on the Goodread synthetic data using ACL (GMF/GMF) in plots 510 and 530, and ACL (MLP/MLP) in plots 520 and 540. Plots 510 and 520 show the training objective for f_(θ) and g_(ψ), i.e.,

_(P) _(n) [δ(Y, f_(θ)(X, Z))/G_(β)(Y, g_(ψ)(X,Z))] and

_(P) _(n) [δ(Y, g_(ψ)(X, Z))]. Plots 530 and 540 show the evaluation metric on the validation dataset. Finally, the regularization parameter for the approach described herein is selected from {0.1, 1, 2}. The hyperparameters that are specific to the adversarial counterfactual training described herein are the initial learning rates r_(θ) and r_(ψ), as well as the regularization parameter a.

The models, including the matrix factorization models, are implemented with PyTorch on a Nvidia V100 GPU machine. The sparse Adam optimizer, available at https://agi.io/2019/02/28/optimization-using-adam-on-sparse-tensors/, is used to update the hidden factors, and the usual Adam optimizer is used to update the remaining parameters. Sparse Adam is used for the hidden factors because both the user and item factor are relatively sparse in recommendation datasets. The Adam algorithm leverages the momentum of the gradients from the previous training batch, which may not be accurate for the item and user factors in the current training batch. The sparse Adam optimizer is designed to solve the above issue for sparse tensors.

The early-stopping training method is used both for the baseline models, such that the training process is terminated when the validation metric stops improving for 10 consecutive epochs. And for this approach, the minimax objective value is monitored, and the training process is terminated if it stops changing for more than ϵ=0.001 after ten consecutive epochs.

It is straightforward to tell that in a single update step, the space and time complexity of this adversarial counterfactual training is exactly the summation for that of f_(θ) and g_(ψ) (where the complexity induced by G_(β) is almost negligible). In general, this approach may take more training epochs to converge depending on the r_(θ)/r_(ψ), in the two-timescale training schema.

To demonstrate the underlying adversarial training process of the adversarial counterfactual training method described herein, the training progress is plotted under several settings in FIG. 5 and FIG. 6. FIG. 6 illustrates plots 610, 620, 630 and 640, which show results of an adversarial training process on the real Goodread data using ACL (Attn/Attn) that results in the same pattern for the sequential recommendation setting, and demonstrates the effectiveness of including the outcomes into the G_(β) for modeling the exposure mechanism. The “use outcome” indicates whether Y is used for modeling G_(β).

From FIG. 5, the following can be observed:

-   -   With a larger initial learning rate, g_(ψ) tends to fit the data         quicker than f_(θ).     -   In the beginning stage, when g_(ψ) has not yet fitted the data         well, its adversarial behavior on f_(θ) is too strong, since         both the loss value and the evaluation metric for f_(θ) is poor         during that period. This also suggests the benefits of using a         larger initial learning rate for g_(θ).     -   As the training progresses, f_(θ) eventually catches up with and         outperforms g_(ψ) in terms on the evaluation metric. However,         the loss objective for f_(θ) is still larger, which is         reasonable since it has the extra adversarial term in         _(P) _(n) [δ(Y, f_(θ)(X, Z)/G_(β)(Y, g_(ψ)(X, Z))], which is         controlled by g_(ψ). This also implies that g_(ψ) is acting         adversarially throughout the whole process, which matches the         design of the adversarial game.     -   The training process gradually achieves the local minimax         optimal, where both f_(θ) and g_(ψ) are unable to undermine the         performance of each other, and their individual performances         improve at the same pace in the latter training phase.

The adversarial training on the real-world dataset using the sequential recommendation model ACL (Attn/Attn) in FIG. 6 is then examined. In plots 610 and 630 of FIG. 6, the same pattern as that of FIG. 5 can be observed, which suggests that the above discussions also apply to the real-world data and the sequential recommendation setting.

Further, a set of experiment were conducted in which the outcome is not included in modeling the exposure mechanism G_(β), as shown in plots 620 and 640. First of all, it is observed that the same adversarial training patterns still hold whether or not the outcome is included in modeling G_(β). Secondly, the performances, both in terms of the loss value and evaluation metric, are less ideal when Y is not included in G_(β).

A complete ablation study was performed. Firstly, the standard evaluations on the real-world data using the propensity score model are shown in table 1210 of FIG. 12 for the three real-world datasets. Table 1000 of FIG. 10 shows standard evaluations (with accounting for exposure) for the baselines and approach described herein on the benchmark data. Similarly, in the config rows are provided the f_(θ) and g_(ψ) model choice when trained with the PS and the ACL approach. Also shown are the best f_(θ) and g_(ψ) combination for the PS method, and the full results for the ACL approach. Compared with the results in table 1000 of FIG. 10, the results in table 1210 of FIG. 12 show that the adversarial counterfactual training approach still outperforms their propensity-score counterparts, which again emphasizes the benefits of having the adversarial process between f_(θ) and g_(ψ). Secondly, table 1220 of FIG. 12 show the standard evaluations for the baseline models trained with the adversarial counterfactual approach (ACL base model) on the real-world dataset. Models trained with the ACL approach uniformly outperform their counterparts. Notice that the superior performances of the ACL approach do not benefit from a larger model complexity, since the hidden factor dimension of the corresponding baseline models was doubled, such that the number of parameters are approximately the same for all models.

Sensitivity analysis is provided for the adversarial counterfactual approach, focusing mostly on the user/item hidden factor dimension size and the regularization parameter α. The results of on the real-world datasets. FIG. 7 illustrates graphs 710, 720, 730, 740, 750, and 760, showing results of sensitivity analysis of hidden factor dimension for the content-based ACL (GMF/GMF) model and the sequential ACL (Attn/Attn) model together with their corresponding baseline models, on the three real-world datasets. The hidden dimensions for the corresponding baselines are doubled from what is shown in the plots to achieve fair comparisons. From the top to bottom are results for the Movielens-1M data in graphs 710 and 720, LastFM data in graphs 730 and 740, and Goodread.com data in graphs 750 and 760. The sensitivity analysis on user/item hidden factor dimension size is shown in FIG. 7, showing that the larger dimensions most often lead to better outcome (within the range considered), which is in accordance with the common consensus in the recommender system domain. This also suggests that the ACL approach inherits some of the properties from the f_(θ) and g_(ψ), so the model understanding diagnostics also become easier if f_(θ) and g_(ψ) are well-studied.

The sensitivity analysis on the regularization parameter is provided in FIG. 8. FIG. 8 illustrates graphs 810, 820, 830, 840, 850, and 860, showing results of sensitivity analysis on the regularization parameter α for the content-based ACL (GMF/GMF) model and the sequential ACL (Attn/Attn) model for their f_(θ) and g_(ψ) components, on the three real-world datasets. From the top to bottom, are results for the Movielens-1M data in graphs 810 and 820, LastFM data in graphs 830 and 840, and Goodread.com data in graphs 850 and 860. The experiment was not performed on a wide range of α; however, the results at hand already show the patterns, which indicate that the ACL approach achieves the best performances when α is neither too big nor too small. In terms of this context, when α is too small, the regularization on g_(ψ) becomes relatively weak compared with the loss objective of f_(θ), so g does not fit the data well. As a consequence, f_(θ) also suffers from the under-fitting issues of g_(ψ). On the other hand, when α gets too large, the minimax game will focus more on fitting g_(ψ) to the data and overlooks f_(θ).

Additionally, the online experiments provide valuable evaluation results that reveal the appeal of the ACL approach for real-world applications. All the online experiments were conducted for a content-based item page recommendation module, under the implicit feedback setting where the users click or not click the recommendations. A list of ten items is shown to the customer on each item page, e.g., items that are similar or complementary to the anchor item on that page. The recommendation is personalized, so the user identification (ID) and user features are included in the model as well.

In each iteration of model deployment, the new item features and user features are added into the previous model. The architecture of the recommendation model generally remains unchanged during the iterations, which makes it favorable for examining the ACL approach. There have been four online experiments (A/B testing) conducted for a total of eight models that are trained offline using the adversarial counterfactual training described herein, and then evaluated using the history implicit feedback data. Unobserved factors such as the real-time user features, page layout and same-page advertisements are continually changing and are thus not included in the analysis. The metric that is used to compare the different offline evaluation methods with online evaluation is the click-through rate.

For synthetic data analysis, the explicit feedback data from MovieLens-1M and Goodreads datasets were used. A baseline CF model was trained and the optimized hidden factors are used to generate a synthetic exposure mechanism, which was treated as the oracle exposure. The implicit feedback data was then generated according to the oracle exposure as well as the optimized hidden factors. Unbiased offline evaluation was possible because of access to the exposure mechanism. Also, to set a reasonable benchmark under the simulation setting, additional experiments were provided in which g_(φ) is given by the oracle exposure model. The results are provided in FIG. 9, which provides unbiased evaluations (using the true exposure) for the baselines and the ACL approach on the semi-synthetic data. Table 910 of FIG. 9 show in the config rows the g_(φ) model (such as using the baseline models and the oracle model) when trained with the propensity-score (PS) approach or the approach described herein (marked by the ACL-). Table 920 shows the original baseline models without using propensity-score approach or ACL. Bold-font and underscore is used to mark the best and second-best outcomes. The mean and standard deviation are computed over ten repetitions. When trained with the approach described herein, the baselines models yield their best performances (other than the oracle-enhanced counterparts) under the unbiased offline evaluation, and outperforms the rest of the baselines, which reveals the first appeal of the ACL approach.

For real data analysis, other than using the MovieLens-1M and Goodreads data in the implicit feedback setting, the LastFM music recommendation (implicit feedback) dataset is further included. The results in table 1000 of FIG. 10 show that the models trained by the ACL approach achieve the best outcome, even using the standard evaluation where the exposure mechanism is not considered. The better performance in standard evaluation suggests the second appeal of the adversarial counterfactual learning, that even though it optimizes towards the minimax setting, the robustness is not at the cost of the performance under the standard evaluation.

For online experiment analysis, to examine the practical benefits of the robust learning and evaluation approach described herein in real-world experiments, several online A/B testing scenarios were carried out on Walmart.com, a major e-commerce platform in the U.S., in a content-based item recommendation setting, with access to the actual online testing and evaluation results. All the candidate models were trained offline using the approach described herein. The standard offline evaluation, popularity-debiased offline evaluation (where the item popularity is used as the propensity score), the propensity-score model approach, and the robust evaluation described herein were compared with respect to the actual online evaluations. Table 1100 of FIG. 11 shows results of the mean-squared error (MSE) to online evaluation results from eight online experiments. Table 1100 of FIG. 11 shows that the evaluation approach described herein is indeed a more robust approximation to online evaluation. It reveals a third appeal of the approaches described herein, that they are capable of narrowing the gap between online and offline evaluations.

In many embodiments, the techniques described herein can improve on the drawback of supervised learning for recommender systems, by using a theoretically-grounded adversarial counterfactual learning and evaluation framework. The theoretical and empirical results illustrate the benefits of the techniques described herein.

Exemplary System and Method

Turning ahead in the drawings, FIG. 13 illustrates a block diagram of a system 1300 that can be employed for generating recommendations using adversarial counterfactual learning and evaluation, according to an embodiment. System 1300 is merely exemplary and embodiments of the system are not limited to the embodiments presented herein. The system can be employed in many different embodiments or examples not specifically depicted or described herein. In some embodiments, certain elements, modules, or systems of system 1300 can perform various procedures, processes, and/or activities. In other embodiments, the procedures, processes, and/or activities can be performed by other suitable elements, modules, or systems of system 1300. System 1300 can be similar to system 300 (FIG. 3), and various components of system 1300 can be similar or identical to system 300 (FIG. 3).

Generally, therefore, system 1300 can be implemented with hardware and/or software, as described herein. In some embodiments, part or all of the hardware and/or software can be conventional, while in these or other embodiments, part or all of the hardware and/or software can be customized (e.g., optimized) for implementing part or all of the functionality of system 1300 described herein. System 1300 can be a computer system, such as computer system 100 (FIG. 1), as described above, and can each be a single computer, a single server, or a cluster or collection of computers or servers, or a cloud of computers or servers. In another embodiment, a single computer system can host system 1300.

In some embodiments, system 1300 can include offline components 1310, such as a data component 1311, a training component 1312, and/or an evaluation component 1313, which can be in data communication with a database 1316 that includes logging data, and can use an algorithm 1315 and/or candidates 1317, such as f_(θ) and g_(φ). System 1300 also can include an online ranking and serving component 1314, which can receive a requests from a front-end component 1320 for recommendations, and can return recommendations to front-end component 1320. Data component 1311 and/or training component 1312 can be similar or identical to training system 312 (FIG. 3). Evaluation component 1313 can be similar or identical to evaluation component 313 (FIG. 3). Online ranking and service component 1314 can be similar or identical to real-time serving system 314 (FIG. 3). Database 1316 can be similar or identical to database 315 (FIG. 3). Front end component 1320 can be similar or identical to web server 320 (FIG. 3).

In many embodiments, data component 1311 can receive raw logging data, such as historical user session data, from database 1316 to prepare training and evaluation data, such as personalized recommendation data and/or item recommendation data. In some embodiments, personalized recommendation data can include, for each data record, a user feature, a view sequence, an item feature, a target purchase, a label, and/or other suitable information. In various embodiments, item recommendation data can include, for each data records, an anchor item, a candidate item, an item feature, a label, and/or other suitable information. In many embodiments, the data can be used by training component 1312 and/or evaluation component 1313. In many embodiments, the label can be positive or negative, which can indicate whether the customer clicked on the item or not. The personalized recommendation data can be personalized to each user, and the item recommendation data can be generalized and not personalized to each user. The training data can provide the type of recommendation to be provided, such as recommendations for similar items or complementary items.

In a number of embodiments, training component 1312 can receive candidates 1317, which can be any candidate recommendation algorithm f_(θ) and an adversarial exposure model g_(φ). In many embodiments, multiple candidate recommendation algorithms f_(θ) can be received, such as for multiple different families of machine learning, such as linear regression, neural network, matrix factorization, etc. In several embodiments, training component 1312 can train candidate recommendation algorithms f_(θ) and an adversarial exposure model g_(φ) using data obtained from data component 1311 and algorithm 1315 to generate optimized candidate recommendation algorithms {circumflex over (f)}_(θ) and the “most adversarial” exposure model ĝ_(φ). In many embodiments, algorithm 1315 can be a gradient ascent descent, as described above in Algorithm 1, used to optimize the minimax objective, such as the objective function described above in Equation 9.

In a number of embodiments, evaluation component 1313 can perform a robust offline evaluation using the most adversarial exposure model ĝ_(φ) to evaluate the optimized candidate recommendation algorithms {circumflex over (f)}_(θ) on the evaluation data obtained from data component 1311, and select the best of the optimized candidate recommendation algorithms {circumflex over (f)}_(θ), which can be denoted the optimal {circumflex over (f)}_(θ).

In several embodiments, the optimal {circumflex over (f)}_(θ) can be fed to online ranking and service component 1314. In several embodiments, a recall set can be constructed, based on logging data in database 1316 to find a subset of candidate item pairs that are more likely to contain the optimal choice, as the full set of candidate item pairs can be too large for applying the model to all candidate item pairs. In several embodiments, the optimal {circumflex over (f)}_(θ) can be used to rank the candidate items in the recall set, and the top-K recommendations can be fed to front end component 1320, such as upon request from front-end component.

Turning ahead in the drawings, FIG. 14 illustrates a flow chart for a method 1400 of generating recommendations using adversarial counterfactual learning and evaluation, according to an embodiment. Method 1400 is merely exemplary and is not limited to the embodiments presented herein. Method 1400 can be employed in many different embodiments or examples not specifically depicted or described herein. In some embodiments, the procedures, the processes, and/or the activities of method 1400 can be performed in the order presented. In other embodiments, the procedures, the processes, and/or the activities of method 1400 can be performed in any suitable order. In still other embodiments, one or more of the procedures, the processes, and/or the activities of method 1400 can be combined or skipped.

In many embodiments, system 300 (FIG. 3), recommendation system 310 (FIG. 3), web server 320 (FIG. 3), and/or system 1300 (FIG. 13) can be suitable to perform method 1400 and/or one or more of the activities of method 1400. In these or other embodiments, one or more of the activities of method 1400 can be implemented as one or more computing instructions configured to run at one or more processors and configured to be stored at one or more non-transitory computer readable media. Such non-transitory computer readable media can be part of system 300 (FIG. 3) and/or system 1300 (FIG. 3). The processor(s) can be similar or identical to the processor(s) described above with respect to computer system 100 (FIG. 1).

In some embodiments, method 1400 and other activities in method 1400 can include using a distributed network including distributed memory architecture to perform the associated activity. This distributed architecture can reduce the impact on the network and system resources to reduce congestion in bottlenecks while still allowing data to be accessible from a central location.

Referring to FIG. 14, method 1400 can include an activity 1405 of obtaining training data. In many embodiments, the training data can be generated from logging data, such as historical user session data. In a number of embodiments, the training data can include personalized recommendation data and/or item recommendation data. In some embodiments, the personalized recommendation data can include first records each including a respective user feature, a respective view sequence, a respective item feature, a respective target purchase, and a respective label. In various embodiments, the item recommendation data can include second records each including a respective anchor item, a respective candidate item, a respective item feature, and a respective label. In a number of embodiments, activity 1405 can be performed at least in part by communication system 311 (FIG. 3), training system 312 (FIG. 3), and/or data component 1311 (FIG. 13).

In a number of embodiments, method 1400 also can include an activity 1410 of training candidate recommendation models and an adversarial exposure model using the training data. The candidate recommendation models can be similar or identical to candidate recommendation algorithms f_(θ) described above.

The adversarial exposure model can be similar or identical to adversarial exposure model g_(φ) described above. In a number of embodiments, the candidate recommendation models can include a linear regression recommendation model, a neural network recommendation model, and/or a matrix factorization recommendation model, or other suitable recommendation models. In several embodiments, activity 1410 can include performing a gradient ascent descent to optimize a minimax objective for each of the candidate recommendation models and the adversarial exposure model. The gradient ascent descent can be similar or identical to Algorithm 1 described above. The minimax objective can be similar or identical to the objective functions described above, such as Equation 9. In many embodiments, activity 1410 can train the candidate recommendation algorithms f_(θ) to generate optimized candidate recommendation algorithms {circumflex over (f)}_(θ), and/or can train the adversarial exposure model g_(φ) to generate the “most adversarial” exposure model ĝ_(φ). In a number of embodiments, activity 1410 can be performed at least in part by training system 312 (FIG. 3) and/or training component 1312 (FIG. 13).

In several embodiments, method 1400 additionally and optionally can include an activity 1415 of performing an evaluation of the candidate recommendation models, as trained, using the adversarial exposure model, as trained. For example, the “most adversarial” exposure model ĝ_(φ) can be used to evaluate the optimized candidate recommendation algorithms {circumflex over (f)}_(θ). In a number of embodiments, activity 1415 can be performed at least in part by evaluation system 313 (FIG. 3) and/or evaluation component 1313 (FIG. 13).

In a number of embodiments, method 1400 further and optionally can include an activity 1420 of selecting the selected recommendation model from among the candidate recommendation models based on the evaluation. For example, the evaluation can be used to determine the optimal {circumflex over (f)}_(θ), which can be the best performing model of the optimized candidate recommendation algorithms {circumflex over (f)}_(θ). In a number of embodiments, activity 1420 can be performed at least in part by evaluation system 313 (FIG. 3), and/or evaluation component 1313 (FIG. 13).

In several embodiments, method 1400 additionally can include an activity 1425 of generating recommendations based on a selected recommendation model of the candidate recommendation models. In a number of embodiments, activity 1420 can be performed at least in part by real-time serving system 314 (FIG. 3) and/or evaluation component 1313 (FIG. 13).

In a number of embodiments, activity 1425 can include an activity 1430 of constructing a recall set of candidate recommendation pairs.

In several embodiments, activity 1425 also can include an activity 1435 of generating a ranking of the candidate recommendation pairs in the recall set using the selected recommendation model. For example, the optimal {circumflex over (f)}_(θ) can be used to rank the candidate items in the recall set.

In a number of embodiments, activity 1425 additionally can include an activity 1440 of determining the recommendations from the ranking For example, a top-K rankings can be used as the recommendation.

In several embodiments, method 1400 additionally and optionally can include an activity 1445 of, when a user requests to view an anchor item, sending one or more of the recommendations associated with the anchor item to be displayed to the user. In many embodiments, the one or more recommendations can be one or more of the recommendations in the top-K rankings determined in activity 1440. In a number of embodiments, activity 1420 can be performed at least in part by communication system 311 (FIG. 3), real-time serving system 314 (FIG. 3), and/or online ranking and service component 1314 (FIG. 13).

In many embodiments, the techniques described herein can provide a practical application and several technological improvements. In some embodiments, the techniques described herein can provide for generating recommendations using adversarial counterfactual learning and evaluation. These techniques described herein can provide a significant improvement over conventional approaches that fail to account for the underlying exposure mechanism.

In a number of embodiments, the techniques described herein can solve a technical problem that arises only within the realm of computer networks, as online ordering is a concept that do not exist outside the realm of computer networks. Moreover, the techniques described herein can solve a technical problem that cannot be solved outside the context of computer networks. Specifically, the techniques described herein cannot be used outside the context of computer networks, in view of a lack of data, and the inability to train the machine-learning recommendation models without a computer.

Various embodiments can include a system including one or more processors and one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, perform certain acts. The acts can include obtaining training data. The acts also can include training candidate recommendation models and an adversarial exposure model using the training data. The acts additionally can include generating recommendations based on a selected recommendation model of the candidate recommendation models.

A number of embodiments can include a method being implemented via execution of computing instructions configured to run at one or more processors. The method can include obtaining training data. The method also can include training candidate recommendation models and an adversarial exposure model using the training data. The method additionally can include generating recommendations based on a selected recommendation model of the candidate recommendation models

Although the methods described above are with reference to the illustrated flowcharts, it will be appreciated that many other ways of performing the acts associated with the methods can be used. For example, the order of some operations may be changed, and some of the operations described may be optional.

In addition, the methods and system described herein can be at least partially embodied in the form of computer-implemented processes and apparatus for practicing those processes. The disclosed methods may also be at least partially embodied in the form of tangible, non-transitory machine-readable storage media encoded with computer program code. For example, the steps of the methods can be embodied in hardware, in executable instructions executed by a processor (e.g., software), or a combination of the two. The media may include, for example, RAMs, ROMs, CD-ROMs, DVD-ROMs, BD-ROMs, hard disk drives, flash memories, or any other non-transitory machine-readable storage medium. When the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the method. The methods may also be at least partially embodied in the form of a computer into which computer program code is loaded or executed, such that, the computer becomes a special purpose computer for practicing the methods. When implemented on a general-purpose processor, the computer program code segments configure the processor to create specific logic circuits. The methods may alternatively be at least partially embodied in application specific integrated circuits for performing the methods.

The foregoing is provided for purposes of illustrating, explaining, and describing embodiments of these disclosures. Modifications and adaptations to these embodiments will be apparent to those skilled in the art and may be made without departing from the scope or spirit of these disclosures.

Although generating recommendations using adversarial counterfactual learning and evaluation has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes may be made without departing from the spirit or scope of the disclosure. Accordingly, the disclosure of embodiments is intended to be illustrative of the scope of the disclosure and is not intended to be limiting. It is intended that the scope of the disclosure shall be limited only to the extent required by the appended claims. For example, to one of ordinary skill in the art, it will be readily apparent that any element of FIGS. 1-14 may be modified, and that the foregoing discussion of certain of these embodiments does not necessarily represent a complete description of all possible embodiments. For example, one or more of the procedures, processes, or activities of FIG. 14 may include different procedures, processes, and/or activities and be performed by many different modules, in many different orders. As another example, the systems within system 300 (FIG. 3) and/or system 1300 (FIG. 13) can be interchanged or otherwise modified.

Replacement of one or more claimed elements constitutes reconstruction and not repair. Additionally, benefits, other advantages, and solutions to problems have been described with regard to specific embodiments. The benefits, advantages, solutions to problems, and any element or elements that may cause any benefit, advantage, or solution to occur or become more pronounced, however, are not to be construed as critical, required, or essential features or elements of any or all of the claims, unless such benefits, advantages, solutions, or elements are stated in such claim.

Moreover, embodiments and limitations disclosed herein are not dedicated to the public under the doctrine of dedication if the embodiments and/or limitations: (1) are not expressly claimed in the claims; and (2) are or are potentially equivalents of express elements and/or limitations in the claims under the doctrine of equivalents. 

What is claimed is:
 1. A system comprising: one or more processors; and one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, perform: obtaining training data; training candidate recommendation models and an adversarial exposure model using the training data; and generating recommendations based on a selected recommendation model of the candidate recommendation models.
 2. The system of claim 1, wherein the training data comprise personalized recommendation data and item recommendation data.
 3. The system of claim 2, wherein the personalized recommendation data comprise first records each comprising a respective user feature, a respective view sequence, a respective item feature, a respective target purchase, and a respective label.
 4. The system of claim 2, wherein the item recommendation data comprise second records each comprising a respective anchor item, a respective candidate item, a respective item feature, and a respective label.
 5. The system of claim 1, wherein the candidate recommendation models comprise a linear regression recommendation model, a neural network recommendation model, and a matrix factorization recommendation model.
 6. The system of claim 5, wherein training the candidate recommendation models and the adversarial exposure model further comprises: performing a gradient ascent descent to optimize a minimax objective for each of the candidate recommendation models and the adversarial exposure model.
 7. The system of claim 1, wherein the computing instructions, when executed on the one or more processors, further perform: performing an evaluation of the candidate recommendation models, as trained, using the adversarial exposure model, as trained.
 8. The system of claim 7, wherein the computing instructions, when executed on the one or more processors, further perform: selecting the selected recommendation model from among the candidate recommendation models based on the evaluation.
 9. The system of claim 1, wherein generating the recommendations based on the selected recommendation model further comprises: constructing a recall set of candidate recommendation pairs; generating a ranking of the candidate recommendation pairs in the recall set using the selected recommendation model; and determining the recommendations from the ranking
 10. The system of claim 1, wherein the computing instructions, when executed on the one or more processors, further perform: when a user requests to view an anchor item, sending one or more of the recommendations associated with the anchor item to be displayed to the user.
 11. A method implemented via execution of computing instructions configured to run at one or more processors, the method comprising: obtaining training data; training candidate recommendation models and an adversarial exposure model using the training data; and generating recommendations based on a selected recommendation model of the candidate recommendation models.
 12. The method of claim 11, wherein the training data comprise personalized recommendation data and item recommendation data.
 13. The method of claim 12, wherein the personalized recommendation data comprise first records each comprising a respective user feature, a respective view sequence, a respective item feature, a respective target purchase, and a respective label.
 14. The method of claim 12, wherein the item recommendation data comprise second records each comprising a respective anchor item, a respective candidate item, a respective item feature, and a respective label.
 15. The method of claim 11, wherein the candidate recommendation models comprise a linear regression recommendation model, a neural network recommendation model, and a matrix factorization recommendation model.
 16. The method of claim 15, wherein training the candidate recommendation models and the adversarial exposure model further comprises: performing a gradient ascent descent to optimize a minimax objective for each of the candidate recommendation models and the adversarial exposure model.
 17. The method of claim 11 further comprising: performing an evaluation of the candidate recommendation models, as trained, using the adversarial exposure model, as trained.
 18. The method of claim 17 further comprising: selecting the selected recommendation model from among the candidate recommendation models based on the evaluation.
 19. The method of claim 11, wherein generating the recommendations based on the selected recommendation model further comprises: constructing a recall set of candidate recommendation pairs; generating a ranking of the candidate recommendation pairs in the recall set using the selected recommendation model; and determining the recommendations from the ranking
 20. The method of claim 11 further comprising: when a user requests to view an anchor item, sending one or more of the recommendations associated with the anchor item to be displayed to the user. 